I have two arrays, say A and B with |A|=8 and |B|=4. I want to calculate the set difference A-B. How do I proceed? Please note that there are no repeated elements in either of the sets.
Edit: Thank you so much everybody for a myriad of elegant solutions. Since I am in prototyping stage of my project, for now I implemented the simplest solution told by Brian and Owen. But I do appreciate the clever use of data structures as suggested here by the rest of you, even Though I am not a computer scientist but an engineer and never studied data structures as a course. Looks like it's about time I should really start reading CLRS which I have been procrastinating for quite a while :) Thanks again!
sort arrays A and B
result will be in C
let a - the first elem of A
let b - the first elem of B
then:
1) while a < b: insert a into C and a = next elem of A
2) while a > b: b = next elem of B
3) if a = b: a = next elem of A and b = next elem of B
4) if b goes to end: insert rest of A into C and stop
5) if a goes to end: stop
Iterate over each element of A, if each of those elements are not in B, then add them to a new set C.
It depends on how you want to represent your sets, but if they are just packed bits then you can use bitwise operators, e.g. D = A & ~B; would give you the set difference A-B if the sets fit into an integer type. For larger sets you might use arrays of integer types and iterate, e.g.
for (i = 0; i < N; ++i)
{
D[i] = A[i] & ~B[i];
}
The following assumes the sets are stored as a sorted container (as std::set does).
There's a common algorithm for merging two ordered lists to produce a third. The idea is that when you look at the heads of the two lists, you can determine which is the lower, extract that, and add it to the tail of the output, then repeat.
There are variants which detect the case where the two heads are equal, and treat this specially. Set intersections and unions are examples of this.
With a set asymmetric difference, the key point is that for A-B, when you extract the head of B, you discard it. When you extract the head of A, you add it to the input unless the head of B is equal, in which case you extract that too and discard both.
Although this approach is designed for sequential-access data structures (and tape storage etc), it's sometimes very useful to do the same thing for a random-access data structure so long as it's reasonably efficient to access it sequentially anyway. And you don't necessarily have to extract things for real - you can do copying and step instead.
The key point is that you step through the inputs sequentially, always looking at the lowest remaining value next, so that (if the inputs have no duplicates) you will the matched items. You therefore always know whether your next lowest value to handle is an item from A with no match in B, and item in B with no match in A, or an item that's equal in both A and B.
More generally, the algorithm for the set difference depends on the representation of the set. For example, if the set is represented as a bit-vector, the above would be overcomplex and slow - you'd just loop through the vectors doing bitwise operations. If the set is represented as a hashtable (as in the tr1 unordered_set) the above is wrong as it requires ordered inputs.
If you have your own binary tree code that you're using for the sets, one good option is to convert both trees into linked lists, work on the lists, then convert the resulting list to a perfectly balanced tree. The linked-list set-difference is very simple, and the two conversions are re-usable for other similar operations.
EDIT
On the complexity - using these ordered merge-like algorithms is O(n) provided you can do the in-order traversals in O(n). Converting to a list and back is also O(n) as each of the three steps is O(n) - tree-to-list, set-difference and list-to-tree.
Tree-to-list basically does a depth-first traversal, deconstructing the tree as it goes. Theres a trick for making this iterative, storing the "stack" in part-handled nodes - changing a left-child pointer into a parent-pointer just before you step to the left child. This is a good idea if the tree may be large and unbalanced.
Converting a list to a tree basically involves a depth-first traversal of an imaginary tree (based on the size, known from the start) building it for real as you go. If a tree has 5 nodes, for instance, you can say that the root will be node 3. You recurse to build a two-node left subtree, then grab the next item from the list for that root, then recurse to build a two-node right subtree.
The list-to-tree conversion shouldn't need to be implemented iteratively - recursive is fine as the result is always perfectly balanced. If you can't handle the log n recursion depth, you almost certainly can't handle the full tree anyway.
Implement a set object in C. You can do it using a hash table for the underlying storage. This is obviously a non trivial exercise, but a few Open Source solutions exist. Then you simply need to add all the elements of A and then iterate over B and remove any that are elements of your set.
The key point is to use the right data structure for the job.
For larger sets I'd suggest sorting the numbers and iterating through them by emulating the code at http://www.cplusplus.com/reference/algorithm/set_difference/ which would be O(N*logN), but since the set sizes are so small, the solution given by Brian seems fine even though it's theoretically slower at O(N^2).
Related
So, I've implemented a binary search tree backed by an array. The full implementation is here.
Because the tree is backed by an array, I determine left and right children by performing arithmetic on the current index.
private Integer getLeftIdx(Integer rootIndex) {
return 2 * rootIndex + 1;
}
private Integer getRightIdx(Integer rootIndex) {
return 2 * rootIndex + 2;
}
I've realized that this can become really inefficient as the tree becomes unbalanced, partly because the array will be sparsely populated, and partly because the tree height will increase, causing searches to tend towards O(n).
I'm looking at ways to rebalance the tree, but I keep coming across algorithms like Day-Stout-Warren which seem to rely on a linked-list implementation for the tree.
Is this just the tradeoff for an array implementation? I can't seem to think of a way to rebalance without creating a second array.
Imagine you have an array of length M that contains N items (with N < M, of course) at various positions, and you want to redistributed them into "valid positions" without changing their order.
To do that you can first walk through the array from end to start, packing all the items together at the end, and then walk through the array from start to end, moving an item into each valid position you find until you run out of items.
This easy problem is the same as your problem, except that you don't want to walk though the array in "index order", you want to walk through it in binary in-order traversal order.
You want to move all the items into "valid positions", i.e. the part of the array corresponding to indexes < N, and you don't want to change their in-order traversal order.
So, walk the array in reverse in-order order, packing items into the in-order-last-possible positions. Then walk forward over the items in order, putting each item into the in-order-first available valid position valid position until you run out of items.
BUT NOTE: This is fun to consider, but it's not going to make your tree efficient for inserts -- you have to do too many rebalancings to keep the array at a reasonable size.
BUT BUT NOTE: You don't actually have to rebalance the whole tree. When there's no free place for the insert, you only have to rebalance the smallest subtree on the path that has an extra space. I vaguely remember a result that I think applies, which suggests that the amortized cost of an insert using this method is O(log^2 N) when your array has a fixed number of extra levels. I'll do the math and figure out the real cost when I have time.
I keep coming across algorithms like Day-Stout-Warren which seem to rely on a linked-list implementation for the tree.
That is not quite correct. The original paper discusses the case where the tree is embedded into an array. In fact, section 3 is devoted to the changes necessary. It shows how to do so with constant auxiliary space.
Note that there's a difference between their implementation and yours, though.
Your idea is to use a binary-heap order, where once you know a single-number index i, you can determine the indices of the children (or the parent). The array is, in general, not sorted in increasing indices.
The idea in the paper is to use an array sorted in increasing indices, and to compact the elements toward the beginning of the array on a rebalance. Using this implementation, you would not specify an element by an index i. Instead, as in binary search, you would indirectly specify an element by a pair (b, e), where the idea is that the index is implicitly specified as ⌊(b + e) / 2⌋, but the information allows you to determine how to go left or right.
Suppose I have a set of sorted doubles.
{ 0.124, 4.567, 12.3 }
A positive, non-zero double is created by another part of the code, and needs to be inserted into this set while keeping it sorted. For example, if the created double is 7.56, the final result is,
{ 0.124, 4.567, 7.56, 12.3 }
In my code, this "create double and insert in sorted set" process is then repeated a great number of times. Possibly 500k to 1 million times. I don't know how many doubles will be created in total exactly, but I know the upper bound.
Attempt
My naive first approach was to create an array with length = upper bound and fill it with zeros, then adding the initial set of doubles to it ("add" = replace a 0 valued entry with the double). Whenever a double is created, I add it to the array and do an insertion sort, which I read is good for sorting ordered arrays.
Question
I have a feeling running 500k to 1 million insertion slots will be a serious performance issue. (or am I wrong?) Is there a more efficient data structure and/or algorithm for doing this in C?
Edit:
The reason why I want to keep the set sorted is because after every "create double and insert in sorted set" process, I need to be able to look up the smallest element in that set (and possibly remove it by replacing it with a 0). I thought the best way to do this would be to keep the set sorted.
But if that is not the case, perhaps there is an alternative?
Since all you want to do is pull out the minimum element in every iteration, use a min-heap instead. You can implement them to have O(1) insertion, O(1) find-min, and O(1) decrease-key operations (though note that removing the minimum element always takes O(log n) time). For what you are doing, a heap will be substantially faster.
Rather than running an insertion sort, you could use binary search to find the insertion point, and then insert the value there. But this is slow, because you may need to shift a lot of data many times (think what happens if the random data comes in sorted in reverse of what you need, the timing would be O(N^2)).
The fastest approach is to insert first, and then sort everything at once. If this is not possible, consider replacing your array with a self-balancing ordered tree structure, such as an RB-Tree.
Let's say that I am streaming non-empty strings (char[]/char*s) into my program. I would like to create a set of them. That is, for any element a in set S, a is unique in S.
I have thought to approach this in a few ways, but have run into issues.
If I knew the amount of items n I would be reading, I could just create a hash table, with all elements beginning as null, of the same size and if there was a collision, do not insert it into that table. When the insertions are done, I would iterate through the array of the hashtable, counting non-null values, size, and then create an array of that size, and then copy all the values to it.
I could use just use a single array and resize it before an element is added, using a search algorithm to check to see if an element already exists before resizing/adding it.
I realize the second method would work, but because the elements may not be sorted, could also take a very long time for large inputs because of choice of search algorithm and resizing, regardless.
Any input would be appreciated. Please feel free to ask questions in the comment box below if you need further information. Libraries would be very helpful! (Google searching "Sets in C" and similar things doesn't help very much.)
A hash table can work even if you didn't know the size of the number of elements that you are going to be inserting ... you would simply define you hash table to use "buckets" (i.e., each position is actually a linked list of elements that hash to the same value), and you would search through each "bucket" to make sure that each element has not already been inserted into the hash-table. The key to avoiding large "buckets" to search through would be a good hash algorithm.
You can also, if you can define a weak ordering of your objects, use a binary search tree. Then if !(A < B) and !(B < A), it can be assumed A == B, and you would therefore not insert any additional iterations of that object into the tree, which again would define a set.
While I know you're using C, consider the fact that in the C++ STL, std::set uses a RB-tree (red-black tree which is a balanced binary search tree), and std::unordered_set uses a hash-table.
Using an array is a bad idea ... resizing operations will take a long time, where-as insertions into a tree can be done in O(log N) time, and for a hash-table, ammortized O(1).
I am currently reading Cormen's "Introduction to Algorithms" and I found something called a sentinel.
It's used in the mergesort algorithm as a tool to decide when one of the two merging lists is exhausted. Cormen uses the infinity symbol for the sentinels in his pseudocode and I would like to know how such an infinite value can be implemented in C.
A sentinel is just a dummy value. For strings, you might use a NULL pointer since that's not a sensible thing to have in a list. For integers, you might use a value unlikely to occur in your data set e.g. if you are dealing with a list ages, then you can use the age -1 to denote the list.
You can get an "infinite value" for floats, but it's not the best idea. For arrays, pass the size explicitly; for lists, use a null pointer sentinel.
in C, when sorting an array, you usually know the size so you could actually sort a range [begin, end) in which end is one past the end of the array. E.g. int a[n] could be sorted as sort(a, a + n).
This allow you to do two things:
call your sort recursively with the part of the array you haven't sorted yet (merge sort is a recursive algorithm)
use end as a sentinel.
If you know the elements in your list will range from the smallest to the highest possible values for the given data type the code you are looking at won't work. You'll have to come up with something else, which I am sure can be done. I have that book in front of me right now and I am looking at the code that is causing you trouble and I have a solution that will work for you if you know the values range from the smallest for the given data type to the largest minus one at most. Open that book back up to page 31 and take a look at the Merge function. The lines causing you problems are lines 8 and 9 where the sentinel value of infinity is being used. Now, we know the two arrays are each sorted already and that we just need to merge them to get the array that is twice as big and in sorted order. This means that the largest elements in each half is at the end of the sub-arrays, and that the larger of the two is the largest in the array that is twice as big and we will have sorted once the merge function has completed. All we need to do is determine the largest of those two values, increment that value by one, and use that as our sentinel. So, lines 8 and 9 of the code should be replaced by the following 6 lines of code:
if L[n1] < R[n2]
largest = R[n2]
else
largest = L[n1]
L[n1 + 1] = largest + 1
R[n2 + 1] = largest + 1
That should work for you. I have a test tomorrow in my algorithms course on this stuff and I came across your post here and thought I'd help you out. The authors' use of sentinels in this book is something that has always bugged me, and I absolutely can not stand how much they are in love with recursion. Iteration is faster and in my opinion usually easier to come up with and grasp.
The trick is that you don't have to check array bounds when incrementing the index in only one of the lists in the inner while loops. Hence you need sentinels that are larger than all other elements. In c++ I usually use std::numeric_limits<TYPE>::max().
The C-equivalent should be macros like INT_MAX, UINT_MAX, LONG_MAX etc. Those are good sentinels. If you need two different sentinels, use ..._MAX and ..._MAX - 1
This is all assuming you're merging two lists that are ordered ascending.
Given two lists l1,l2, show how to merge them in O(1) time. The data structures for the lists depends on how you design it. By merging I mean to say union of the lists.
Eg: List1 = {1,2,3}
List2 = {2,4,5}
Merged list = {1,2,3,4,5}
On "merging" two sorted lists
It is straightforwardly impossible to merge two sorted lists into one sorted list in O(1).
About the closest thing you can do is have a "lazy" merge that can extract on-demand each successive element in O(1), but to perform the full merge, it's still O(N) (where N is the number of elements).
Another close thing you can do is physically join two lists ends to end into one list, performing no merge algorithm whatsoever, such that all elements from one list comes before all elements from the other list. This can in fact be done in O(1) (if the list maintains head and tail pointers), but this is not a merge by traditional definition.
On set union in O(1)
If the question is about what kind of set representation allows a union operation in O(1), then yes, this can in fact be done. There are many set representation possible in practice, each with some pluses and minuses.
An essential example of this specialized representation is the disjoint-set data structure, which permits a O(α(n)) amortized time for the elementary Union and Find operations. α is the inverse of the Ackermann function; where as the Ackermann function grows extremely quickly, its inverse α grows extremely slowly. The disjoint-set data structure essentially offers amortized O(1) operations for any practical size.
Note that "disjoint" is a key here: it can not represent two sets {1, 2, 3} and {2, 4, 5}, because those sets are not disjoint. The disjoint set data structure represents many sets (not just two), but no two distinct set is allowed to have an element in common.
Another highly practical set representation is a bit array, where elements are mapped to bit indices, and a 0/1 indicates absence/presence respectively. With this representation, a union is simply a bitwise OR; intersection a bitwise AND. Asymptotically this isn't the best representation, but it can be a highly performant set representation in practice.
What you are looking for is not an algorithm to merge two "lists" in O(1) time. If you read "lists" as "linked lists" then this cannot be done faster than O(n).
What you are being asked to find is a data structure to store this data in that supports merging in O(1) time. This data structure will not be a list. It is still in general impossible to merge in "hard" O(1) time, but there are data structures that support merging in amortized O(1) time. Perhaps the most well-known example is the Fibonacci Heap.
I am not that experienced so please do not hit me hard if I say something stupid.
Will this work ? Since you have two linklists how about you connect
on the last element of the first list
the first element of the second list ? We are still talking about pointers right ? the pointer of the last element of the list is now pointing to the first element of the second list.
Does this work ?
Edit : but we are looking for the union. so I guess it wont...
You make use of the fact that a PC is a finite state machine with 2^(bits of memory/storage space) states and thereby declare everything O(1).