I have four sorted lists and I want to merge them into one sorted list.
What is the most efficient way to do that? If the implementation can be done in parallel, it's a plus.
This is the merge part of merge sort.
Just take the minimum of the four elements at the head of each list and dump it into the output one. Repeat until all lists are empty. Assuming that min4 is a fixed cost, then this is just going to be O(N).
If you have more information (such as the range of the list) you can probably improve things a little, but I don't think these affect the asymptotic complexity.
Related
This might be a silly question; for all of you who have been in this field for a while, nevertheless, I would still appreciate your insight on the matter - why does an array need to be sorted, and in what scenario would we need to sort an array?
So far it is clear to me that the whole purpose of sorting is to organize the data in such a way that will minimize the searching complexity and improve the overall efficiency of our program, though I would appreciate it if someone could describe a scenario in which it would be most useful to sort an array? If we are searching for something specific like a number wouldn't the process of sorting an array be equally demanding as the process of just iterating through the array until we find what we are looking for? I hope that made sense.
Thanks.
This is just a general question for my coursework.
A lot of different algorithms work much faster with sorted array, including searching, comparing and merging arrays.
For one time operation, you're right, it is easier and faster to use unsorted array. But as soon as you need to repeat the operation multiple times on the same array, it is much faster to sort the array one time, and then use its benefits.
Even if you are going to change array, you can keep it sorted, again it improves performance of all other operations.
Sorting brings useful structure in a list of values.
In raw data, reading a value tells you absolutely nothing about the other values in the list. In a sorted list, when you read a value, you know that all preceding elements are not larger, and following elements are not smaller.
So to search a raw list, you have no other choice than exhaustive comparison, while when searching a sorted list, comparing to the middle element tells you in which half the searched value can be found and this drastically reduces the number of tests to be performed.
When the list is given in sorted order, you can benefit from this. When it is given in no known order, you have to ponder if it is worth affording the cost of the sort to accelerate the searches.
Sorting has other algorithmic uses than search in a list, but it is always the ordering property which is exploited.
I have huge amount of data (mainly of type long long) which is mostly sorted (data is spread in different files and in each file data is in sorted format). I need to dump this data into a file in sorted manner. Which data structure should I use. I am thinking about BST.
Is there any other DS I should use which can give me the optimum performance ?
Thanks
Arpit
Using any additional data structure won't help. Since most of your data is already sorted and you just need to fix the occasional value, use a simple array to extract data, then use Insertion Sort.
Insertion sort runs in O(n) for mostly presorted data.
However, this depends if you can hold large enough an array in memory or not depending upon your input size.
Update:
I wasn't very clear on your definition of "mostly sorted". Generally it means only few elements are not in the precise sorted position.
However, as you stated further, 'data is in different files where each file is individually sorted', then may be it is a good candidate for the sub function call - Merge as in merge Sort.
Note that Merge routine, merges two already sorted arrays. If you have say 10 files where each of them is individually sorted for sure, then using Merge routine would only take O(n).
However, if you have even a few off instances where a single file is not perfectly sorted (on its own), you need to use Insertion Sort.
Update 2:
OP says he cannot use an array because he cannot know the number of records in advance. Using simple link list is out of question, since that never competes with arrays (sequential vs random access time) in time complexity.
Pointed out in comments, using link list is a good idea IF the files are individually sorted and all you need to run on them is the merge procedure.
Dynamically allocated arrays are best, if he can predict size at some point. Since c++ tag was used (only removed latter), going for vector would be a good idea, since it can re size comfortably.
Otherwise, one option might be Heap Sort, since it would call heapify first i.e. build a heap (so it can dynamically accommodate as many elements needed) and still produce O(nlogn) complexity. This is still better than trying to use a link list.
Perhaps you don't need a data structure at all.
If the files are already sorted, you can use the merge part of merge sort, which is O(n), or more generally O(n*log k), where k is the number of files.
How many files do you have to merge?
If it's only a few (on the order of a dozen or so) and each individual file is fully sorted, then you shouldn't need to build any sort of complex data structure at all: just open all the input files, read the next record from each file, compare, write the smallest to the destination, then replace that record from the appropriate file.
If each file is not fully sorted or if there are too many files to open at once, then yes, you will need to build an intermediate data structure in memory. I'd recommend a self-balancing tree, but since the data are already mostly sorted, you'll be re-balancing on almost every insert. A heap may work better for your purposes.
Best Sorting Algorithm:
Insertion sort can be used efficiently for nearly sorted data (O(n) time complexity).
Best data structure:
Linked list is the best choice for the data structure if you are sorting it using insertion sort.
Reason for using linked list:
Removing and inserting elements can be done faster when elements are stored as a linked list.
I need to add an unknown number of times using pthreads to a data structure and order them earliest first, can anybody recommend a good structure (linked list/ array list) for this?
A linked list will be O(n) in finding the place where the new object is to go, but constant in inserting it.
A dynamic array/array list will be O(log(n)) finding the right place but worst case O(n) insertion, since you'll need to move all values past the insertion point one over.
If you don't need random access, or at least not until the end, you could use a heap, O(log(n)) insertion, after you're done you can pull them out in O(log(n)) each, so O(n*log(n)) for all of them.
And it's possible there's a (probably tree-based) structure that can do all of it in O(log(n)) (red-black tree?).
So, in the end it boils down to how, precisely, you want to use it.
Edit: Looked up red-black trees and it looks like they are O(log(n)) search ("amortized O(1)", according to Wikipedia), insertion, and deletion, so that may be what you want.
If you just need to order at the end, use a linked-list to store the pthreads maintaining a count of records added. Then create an array of size count copying the elements to the newly created array and deleting them from the list.
Finally sort the array using qsort.
If you need to maintain an ordered list of pthreads use heap
The former approach would have the following complexity
O(n) for Insert
O(nlog(n)) for Sorting
The Later Approach would have
O(nlog(n)) for Insert and Fetching
You can also see priority queue
Please note if you are open in using STL, you can go for STL priority_queue
In terms of memory the later would consume more memory because you have to store two pointers per node.
I'm scanning a large data source, currently about 8 million entries, extracting on string per entry, which I want in alphabetical order.
Currenlty I put them in an array then sort an index to them using qsort() which works fine.
But out of curiosity I'm thinking of instead inserting each string into a data structure that maintains them in alphabetical order as I scan them from the data source, partly for the experience of emlplementing one, partly because it will feel faster without the wait for the sort to complete after the scan has completed (-:
What data structure would be the most straightforward to implement in C?
UPDATE
To clarify, the only operations I need to perform are inserting an item and dumping the index when it's done, by which I mean for each item in the original order dump an integer representing the order it is in after sorting.
SUMMARY
The easiest to implement are binary search trees.
Self balancing binary trees are much better but nontrivial to implement.
Insertion can be done iteratively but in-order traversal for dumping the results and post-order traversal for deleting the tree when done both require either recursion or an explicit stack.
Without implementing balancing, runs of ordered input will result in the degenerate worst case which is a linked list. This means deep trees which severely impact the speed of the insert operation.
Shuffling the input slightly can break up ordered input significantly and is easier to implement that balancing.
Binary search trees. Or self-balancing search trees. But don't expect those to be faster than a properly implemented dynamic array, since arrays have much better locality of reference than pointer structures. Also, unbalanced BSTs may "go linear", so your entire algorithm becomes O(n²), just like quicksort.
You are already using the optimal approach. Sort at the end will be much cheaper than maintaining an online sorted data structure. You can get the same O(logN) with a rb-tree but the constant will be much worse, not to mention significant space overhead.
That said, AVL trees and rb-trees are much simpler to implement if you don't need to support deletion. Left-leaning rb tree can fit in 50 or so lines of code. See http://www.cs.princeton.edu/~rs/talks/LLRB/ (by Sedgewick)
You could implement a faster sorting algorithm such us Timsort or other sorting algorithms with a nlog(n) worst case and just search it using Binary search since its faster if the list is sorted.
you should take a look at Trie datastructure wikilink
i think this will serve what you want
I am on a system with only about 512kb available to my application (the rest is used for buffers). I need to be as efficient as possible.
I have about 100 items that are rapidly added/deleted from a list. What is an efficient way to store these in C and is there a library (with a good license) that will help? The list never grows above 256 items and its average size is 15 items.
Should I use a Binary Search Tree?
Red Black Tree
With an average size of 15, all these other solutions are unnecessary overkill; a simple dynamic array is best here. Searching is a linear pass over the array and insertion and deletion requires moving all elements behind the insertion point. But still this moving around will be offset by the lack of overhead for so few elements.
Even better, since you’re doing a linear search anyway, deleting at arbitrary points can be done by swapping the last element to the deleted position so no further moving around of elements is required – yielding O(1) for insertion and deletion and O(very small n) for lookup.
If your list is no longer then 256, the best option will be to hold a hash table and add/remove each new element with hash function. this way each add/remove will take you only O(1), and the size of the used memory doesn't need to be large.
I would use a doubly-linked-list. When dealing with tens or hundreds of items, its not terribly slower to search than an array, and it has the advantage of only taking up as much space as it absolutely needs. Adding and removing elements is very simple, and incurs very little additional overhead.
A tree structure is faster for searching, but has more overhead when adding or removing elements. That said, when dealing with tens or hundreds of items, the difference probably isn't significant. If I were you, I'd build an implementation of each and see which one is faster in actual usage.
15 items, BST should be fine if you can keep them sorted, not sure if the overhead will be much better than a linked list or an array if the items are rather small. For a lot of insertions/deletions I recommend a linked list because the only thing you have to do is patch pointers.
What's wrong with a plain old array? You said "list" so presumably order is important, so you can't use a hash set (if you do use a hash set, use probing, not chaining).
You can't use a linked list because it would double your memory requirements. A tree would have the same problem, and it would be much more difficult to implement.