I need to add an unknown number of times using pthreads to a data structure and order them earliest first, can anybody recommend a good structure (linked list/ array list) for this?
A linked list will be O(n) in finding the place where the new object is to go, but constant in inserting it.
A dynamic array/array list will be O(log(n)) finding the right place but worst case O(n) insertion, since you'll need to move all values past the insertion point one over.
If you don't need random access, or at least not until the end, you could use a heap, O(log(n)) insertion, after you're done you can pull them out in O(log(n)) each, so O(n*log(n)) for all of them.
And it's possible there's a (probably tree-based) structure that can do all of it in O(log(n)) (red-black tree?).
So, in the end it boils down to how, precisely, you want to use it.
Edit: Looked up red-black trees and it looks like they are O(log(n)) search ("amortized O(1)", according to Wikipedia), insertion, and deletion, so that may be what you want.
If you just need to order at the end, use a linked-list to store the pthreads maintaining a count of records added. Then create an array of size count copying the elements to the newly created array and deleting them from the list.
Finally sort the array using qsort.
If you need to maintain an ordered list of pthreads use heap
The former approach would have the following complexity
O(n) for Insert
O(nlog(n)) for Sorting
The Later Approach would have
O(nlog(n)) for Insert and Fetching
You can also see priority queue
Please note if you are open in using STL, you can go for STL priority_queue
In terms of memory the later would consume more memory because you have to store two pointers per node.
Related
I am trying to figure out the time efficiency of mergesort on a linked list versus an array of pointers (Not worrying about how I am going to use it in the future, solely the speed at which the data get sorted).
Which would be faster. I imagine using an array of pointers requires an additional layer of memory access.
But at the same time, accessing a linked list would be slower Assuming we go in already knowing the linked list length, mergesort would still require iterating through the linked list jumping from memory to memory til you get a pointer to the middle node of the linked list, which I think thinks more time than an array.
Does anyone have any insights? Is it more contextual to the data being sorted?
The primary difference between implementing merge sort of a linked list versus an array of pointers is that with the array you end up having to use a secondary array. The algorithmic complexity is the same, O(n * log(n)) is the same, but the array version uses O(n) extra memory. You don't need to use that extra memory in the linked list case.
In real world implementation, runtime performance of the two should differ by a constant factor, but not enough to favor one over the other. That is, if you have an array of pointers, you probably won't benefit from turning it into a linked list, sorting, and converting it back to an array. Nor would you, given a linked list, benefit from creating an array, sorting it, and then building a new array.
I would like to implement data structure which is able to make fast insertion and keeping data sorted, without duplicates, after every insert.
I thought about binomial heap, but what I understood about that structure is that it can't tell during insertion that particular element is yet in heap. On the another hand there is AVL tree, which fits perfectly for my case, but honestly there are rather too hard for implement for me, at that moment.
So my question is: is there any possiblity to edit binomial heap insertion algorithm to skip duplicates? Maybe anyoune could suggest another structure?
Grettings :)
In C++, there is std::set. it is internally an implementation of red black tree. So it will sort when you enter data.You can have a look into that for a reference.
A good data structure for this is the red-black tree, which is O(log(n)) for insertion. You said you would like to implement a data structure that does this. A good explanation of how to implement that is given here, as well as an open source usable library.
If you're okay using a library you may take a look at libavl Here
The library implements some other varieties of binary trees as well.
Skip lists are also a possibility if you are concerned with thread safety. Balanced binary search trees will perform more poorly than a skip list in such a case as skip lists require no rebalancing, and skip lists are also inherently sorted like a BST. There is a disadvantage in the amount of memory required (since multiple linked lists are technically used), but theoretically speaking it's a good fit.
You can read more about skip lists in this tutorial.
If you have a truly large number of elements, you might also consider just using a doubly-linked list and sorting the list after all items are inserted. This has the benefit of ease of implementation and insertion time.
You would then need to implement a sorting algorithm. A selection sort or insertion sort would be slower but easier to implement than a mergesort, heapsort, or quicksort algorithm. On the other hand, the latter three are not terribly difficult to implement either. The only thing to be careful about is that you don't overflow the stack since those algorithms are typically implemented using recursion. You could create your own stack implementation (not difficult) and implement them iteratively, pushing and popping values onto your stack as necessary. See Iterative quicksort for an example of what I'm referring to.
if you looking for fast insertion and easy implemantaion why not linked list (single or double).
insertion : push head/ push tail - O(1)
remove: pop head/pop tail - O(1)
the only BUT is "find" will be in O(n)
I have huge amount of data (mainly of type long long) which is mostly sorted (data is spread in different files and in each file data is in sorted format). I need to dump this data into a file in sorted manner. Which data structure should I use. I am thinking about BST.
Is there any other DS I should use which can give me the optimum performance ?
Thanks
Arpit
Using any additional data structure won't help. Since most of your data is already sorted and you just need to fix the occasional value, use a simple array to extract data, then use Insertion Sort.
Insertion sort runs in O(n) for mostly presorted data.
However, this depends if you can hold large enough an array in memory or not depending upon your input size.
Update:
I wasn't very clear on your definition of "mostly sorted". Generally it means only few elements are not in the precise sorted position.
However, as you stated further, 'data is in different files where each file is individually sorted', then may be it is a good candidate for the sub function call - Merge as in merge Sort.
Note that Merge routine, merges two already sorted arrays. If you have say 10 files where each of them is individually sorted for sure, then using Merge routine would only take O(n).
However, if you have even a few off instances where a single file is not perfectly sorted (on its own), you need to use Insertion Sort.
Update 2:
OP says he cannot use an array because he cannot know the number of records in advance. Using simple link list is out of question, since that never competes with arrays (sequential vs random access time) in time complexity.
Pointed out in comments, using link list is a good idea IF the files are individually sorted and all you need to run on them is the merge procedure.
Dynamically allocated arrays are best, if he can predict size at some point. Since c++ tag was used (only removed latter), going for vector would be a good idea, since it can re size comfortably.
Otherwise, one option might be Heap Sort, since it would call heapify first i.e. build a heap (so it can dynamically accommodate as many elements needed) and still produce O(nlogn) complexity. This is still better than trying to use a link list.
Perhaps you don't need a data structure at all.
If the files are already sorted, you can use the merge part of merge sort, which is O(n), or more generally O(n*log k), where k is the number of files.
How many files do you have to merge?
If it's only a few (on the order of a dozen or so) and each individual file is fully sorted, then you shouldn't need to build any sort of complex data structure at all: just open all the input files, read the next record from each file, compare, write the smallest to the destination, then replace that record from the appropriate file.
If each file is not fully sorted or if there are too many files to open at once, then yes, you will need to build an intermediate data structure in memory. I'd recommend a self-balancing tree, but since the data are already mostly sorted, you'll be re-balancing on almost every insert. A heap may work better for your purposes.
Best Sorting Algorithm:
Insertion sort can be used efficiently for nearly sorted data (O(n) time complexity).
Best data structure:
Linked list is the best choice for the data structure if you are sorting it using insertion sort.
Reason for using linked list:
Removing and inserting elements can be done faster when elements are stored as a linked list.
I'm scanning a large data source, currently about 8 million entries, extracting on string per entry, which I want in alphabetical order.
Currenlty I put them in an array then sort an index to them using qsort() which works fine.
But out of curiosity I'm thinking of instead inserting each string into a data structure that maintains them in alphabetical order as I scan them from the data source, partly for the experience of emlplementing one, partly because it will feel faster without the wait for the sort to complete after the scan has completed (-:
What data structure would be the most straightforward to implement in C?
UPDATE
To clarify, the only operations I need to perform are inserting an item and dumping the index when it's done, by which I mean for each item in the original order dump an integer representing the order it is in after sorting.
SUMMARY
The easiest to implement are binary search trees.
Self balancing binary trees are much better but nontrivial to implement.
Insertion can be done iteratively but in-order traversal for dumping the results and post-order traversal for deleting the tree when done both require either recursion or an explicit stack.
Without implementing balancing, runs of ordered input will result in the degenerate worst case which is a linked list. This means deep trees which severely impact the speed of the insert operation.
Shuffling the input slightly can break up ordered input significantly and is easier to implement that balancing.
Binary search trees. Or self-balancing search trees. But don't expect those to be faster than a properly implemented dynamic array, since arrays have much better locality of reference than pointer structures. Also, unbalanced BSTs may "go linear", so your entire algorithm becomes O(n²), just like quicksort.
You are already using the optimal approach. Sort at the end will be much cheaper than maintaining an online sorted data structure. You can get the same O(logN) with a rb-tree but the constant will be much worse, not to mention significant space overhead.
That said, AVL trees and rb-trees are much simpler to implement if you don't need to support deletion. Left-leaning rb tree can fit in 50 or so lines of code. See http://www.cs.princeton.edu/~rs/talks/LLRB/ (by Sedgewick)
You could implement a faster sorting algorithm such us Timsort or other sorting algorithms with a nlog(n) worst case and just search it using Binary search since its faster if the list is sorted.
you should take a look at Trie datastructure wikilink
i think this will serve what you want
Welcome mon amie,
In some homework of mine, I feel the need to use the Graph ADT. However, I'd like to have it, how do I say, generic. That is to say, I want to store in it whatever I fancy.
The issue I'm facing, has to do with complexity. What data structure should I use to represent the set of nodes? I forgot to say that I already decided to use the Adjacency list technic.
Generally, textbooks mention a linked list, but, it is to my understanding that whenever a linked list is useful and we need to perform searches, a tree is better.
But then again, what we need is to associate a node with its list of adjacent nodes, so what about an hash table?
Can you help me decide in which data structure (linked list, tree, hash table) should i store the nodes?
...the Graph ADT. However, I'd like to have it, how do I say, generic. That is to say, I want to store in it whatever I fancy.
That's basically the point of an ADT (Abstract Data Type).
Regarding which data structure to use, you can use either.
For the set of nodes, a Hash table would be a good option (if you have a good C implementation for it). You will have amortized O(1) access to any node.
A LinkedList will take worst case O(n) time to find a node, a Balanced Tree will be O(logn) and won't give you any advantage over the hash table unless for some reason you will sort the set of nodes a LOT (in which case using an inorder traversal of a tree is sorted and in O(n) time)
Regarding adjacency lists for each node, it depends what you want to do with the graph.
If you will implement only, say, DFS and BFS, you need to iterate through all neighbors of a specific node, so LinkedList is the simplest way and it is sufficient.
But, if you need to check if a specific edge exists, it will take you worst case O(n) time because you need to iterate through the whole list (An Adjacency Matrix implementation would make this op O(1))
A LinkedList of adjacent nodes can be sufficient, it depends on what you are going to do.
If you need to know what nodes are adjacent to one another, you could use an adjacency matrix. In other words, for a graph of n nodes, you have an n x n matrix whose entry for (i,j) is 1 if i and j are next to each other in the graph.