Optimal tree data structure for merging real-time sequential data - c

I am working on a project in C where I have to store, sort and update the information obtained in real time. The maximum amount of information I can have is defined. The information obtained is <key,value1,value2>, but is sorted according to the key and value1. The key would indicate the start of the node and value1 would indicate its size.
The basic operations I would need to perform here are insertion, search, deletion but most importantly merge if I find a sequential information.
As an example, in an empty structure, I input <100,1> - this will create one node.
Next if I input <102,2> - this will create another node. So 2 nodes would exist in the tree.
Next if I input <101,1> - this should at the end create only one node in the tree: <100,4>
I also want to separately sort these nodes according to their value2. Note that value2 could be updated in real-time.
I was thinking about B+ tree because of its logarithmic performance in all cases since all leaf nodes are at the same level. And by the use of separate doubly linked list, I can create separate links to sort the nodes according to value2.
But the overhead for sorting according to <key,value1> would be I always have to do a search operation; for merge, for add and for delete.
Any thoughts/suggestions about this?

Related

Why is deletion and insertion in Link list faster than arraylist?

I am trying to understand why insertion and deletion in LinkList is O(1) rather than O(N) like in ArrayLists. The common explanation is that because LL is a formed from a doubly linked list you simply have to changes the references. But don't you still need to find the place where you are inserting or deleting to? Do you not traverse the LL to reach the address in question before you can even change the next and previous references making it a O(N) time?
But don't you still need to find the place where you are inserting or deleting to? Do you not traverse the LL to reach the address in question before you can even change the next and previous references making it a O(N) time?
Not always. In many cases, a data structure can hold a reference (or pointer/iterator) directly to a node of a list. This reference help algorithms to quickly delete the node later (in O(1), without any traversal). For example, this can be the case on a network server handling a list of clients: a data structure containing contextual information about a client can reference a node in order to delete quickly a client from the list once the client is disconnected (so to scale well with the number of active client). In this case, no full traversal is needed.

How do databases implement skipping?

I'm writing my own little database engine. Is there any efficient way to implement skipping function other than inspecting every leaf nodes of B+tree, which will be slow with large entries.
If you are using a B+tree for your indices, all the values are stored in leaves and thus can be linked together to form an (ordered) linked list, or rather an unrolled linked list. That's the main advantage of B+ tree over plain B trees.
That said, even if unrolled lists let you do some form of skipping, nothing prevents you from implementing skip lists on your records, and using the the nodes of these lists as your btree values.
2 years after, but anyway.
You can do it in Cassandra's way too. No limit, but you specify the last key from previous query, e.g.
select * from abc where key > 123 limit 100
where 123 is last key from previous query

C Database Design, Sortable by Multiple Fields

If memory is not an issue for my particular application (entry, lookup, and sort speed being the priorities), what kind of data structure/concept would be the best option for a multi-field rankings table?
For example, let's say I want to create a Hall of Fame for a game, sortable by top score (independent of username), username (with all scores by the same user placed together before ranking users by their highest scores), or level reached (independent of score or name). In this example, if I order a linked list, vector, or any other sequential data structure by the top score of each player, it makes searching for the other fields -- like level and non-highest scores -- more iterative (i.e. iterate across all looking for the stage, or looking for a specific score-range), unless I conceive some other way to store the information sorted when I enter new data.
The question is whether there is a more efficient (albeit complicated and memory-consumptive) method or database structure in C/C++ that might be primed for this kind of multi-field sort. Linked lists seem fine for simple score rankings, and I could even organize a hashtable by hashing on a single field (player name, or level reached) to sort by a single field, but then the other fields take O(N) to find, worse to sort. With just three fields, I wonder if there is a way (like sets or secondary lists) to prevent iterating in certain pre-desired sorts that we know beforehand.
Do it the same way databases do it: using index structures. You have your main data as a number of records (structs), perhaps ordered according to one of your sorting criteria. Then you have index structures, each one ordered according to one of your other sorting criteria, but these index structures don't contain copies of all the data, just pointers to the main data records. (Think "index" like the index in a book, with page numbers "pointing" into the main data body.)
Using ordered linked list for your index structures will give you a fast and simple way to go through the records in order, but it will be slow if you need to search for a given value, and similarly slow when inserting new data.
Hash tables will have fast search and insertion, but (with normal hash tables) won't help you with ordering at all.
So I suggest some sort of tree structure. Balanced binary trees (look for AVL trees) work well in main memory.
But don't forget the option to use an actual database! Database managers such as MySQL and SQLite can be linked with your program, without a separate server, and let you do all your sorting and indexing very easily, using SQL embedded in your program. It will probably execute a bit slower than if you hand-craft your own main-memory data structures, or if you use main-memory data structures from a library, but it might be easier to code, and you won't need to write separate code to save the data on disk.
So, you already know how to store your data and keep it sorted with respect to a single field. Assuming the values of the fields for a single entry are independent, the only way you'll be able to get what you want is to keep three different lists (using the data structure of your choice), each of which are sorted to a different field. You'll use three times the memory's worth of pointers of a single list.
As for what data structure each of the lists should be, using a binary max heap will be effective. Insertion is lg(N), and displaying individual entries in order is O(1) (so O(N) to see all of them). If in some of these list copies the entries need to be sub-sorted by another field, just consider that in the comparison function call.

How is lazy deletion advantageous/disadvantageous to a binary tree or linked list?

Recently, for a data structures class, I was asked the question of how a lazy deletion (that is, a deletion that first marks the items needed to be deleted, and then at some time later deletes all the marked items) would be advantageous/disadvantageous to an array, linked list, or binary tree. Here is what I have come up with:
This would help arrays because you would save the time that is taken to shift the array every time an index is deleted although in algorithms that need to traverse the array, there could be inefficiencies.
This would not help linked lists because you need to traverse O(n) to mark an item to be deleted anyways.
I'm not entirely sure about binary trees, but if it were a linked list implementation of one I would imagine it is like the linked list?
I think it all depends on the circumstances and the requirements. In a general sense using this method where they are marked and then deleted later they all have a lot of similar pros and cons.
Similar pros:
-When marked for deletion there is no shifting of the data structures, which makes deletion much faster.
-You can insert on top of a deleted item, which means no shifting for inserts either, plus insertion can be done faster since it can write over the first delete instead of looking for the end of a list
Similar cons:
-Waste of space for deleted item, since they are just sitting there
-Have to transverse twice to delete an item, once to mark it and once again to delete it
-Many marked items for deletion will pollute the data structure making searches take longer since deleted items have to be searched over.
Perhaps you need to think a little deeper on the implementation of linked lists. You indicate that lazy deletion would NOT help in any way because the time to search is all the time that is necessary to perform the delete.
Think about what it takes to actually REMOVE an item from a linked list.
note: this assumes a SINGLY linked list (not a double linked list)
1) Find the item to delete (because this is a SINGLE linked list, you always have to search, because you need the PREV item)
2) Keep pointers to the PREV and NEXT elements
3) Fix the "PREV" element to point to the NEXT element - thus, isolating the CURRENT element
3.5) in a double linked list, you also have to take care of the NEXT element pointing back to the PREV element.
4) Release the memory associated with CURRENT element
Now, what is the process for a lazy delete? --- much shorter
1) Find the item to delete (you may not even have to perform a search, as you already have a pointer to the object you want deleted?)
2) mark the item for deletion.
*) Wait for "Garbage collection" thread to run and actually perform the remaining steps WHEN the system is "IDLE"
A binary tree implemented as a linked list where each element has a left and a right - however, you still perform the same steps in the search. Binary tree searching is just more efficient with O(Log(n)) I believe.
However, deleting from these becomes more complex because you have more pointers to deal with (both a "LEFT" and a "RIGHT") - so will take more instructions to fix, especially when you are deleting a tree node that has pointers to nodes for both the left and right -- one of them will need to be promoted to the new root - however, what if they also both already have their left and right pointers assigned? where does the original "left/right" node go? - You have to re-balance the tree at this point. Thus, there is significant savings by mark for delete from a user perspective and having an "idle" garbage collection taking care of the memory details (so the user doesn't have to wait for that).
I would think the answer would be "it depends" for various reasons, but I think you're on the right track.
1) I agree with your answer about arrays, assuming that there is a requirement that there are no holes in your array. If you are not required to shift the array around on every delete, then the proposed mark now, delete later approach wouldn't help at all. Either way, you are dealing with O(n) vs. (2O(n) = O(n)) for the algorithm, which are equal. The real thing to think about is "does reordering all of the deletions at one time vs. reordering each of the deletions individually save you any time?" Assuming m is the number of deletions, the number of times that each element after the first deletion in your array is reordered is O(m) for the delete immediately approach as compared to O(1) for the mark now, delete later approach.
2) I agree with your answer for linked lists.
3) As for Binary Tree, I suppose it would depend what kind it is. If you are using a sorted, balanced binary tree, you'll have to consider the same question as in 1) above, but if not, your thoughts are correct, and it should behave exactly like a linked list.

C Directed Graph Implementation Choice

Welcome mon amie,
In some homework of mine, I feel the need to use the Graph ADT. However, I'd like to have it, how do I say, generic. That is to say, I want to store in it whatever I fancy.
The issue I'm facing, has to do with complexity. What data structure should I use to represent the set of nodes? I forgot to say that I already decided to use the Adjacency list technic.
Generally, textbooks mention a linked list, but, it is to my understanding that whenever a linked list is useful and we need to perform searches, a tree is better.
But then again, what we need is to associate a node with its list of adjacent nodes, so what about an hash table?
Can you help me decide in which data structure (linked list, tree, hash table) should i store the nodes?
...the Graph ADT. However, I'd like to have it, how do I say, generic. That is to say, I want to store in it whatever I fancy.
That's basically the point of an ADT (Abstract Data Type).
Regarding which data structure to use, you can use either.
For the set of nodes, a Hash table would be a good option (if you have a good C implementation for it). You will have amortized O(1) access to any node.
A LinkedList will take worst case O(n) time to find a node, a Balanced Tree will be O(logn) and won't give you any advantage over the hash table unless for some reason you will sort the set of nodes a LOT (in which case using an inorder traversal of a tree is sorted and in O(n) time)
Regarding adjacency lists for each node, it depends what you want to do with the graph.
If you will implement only, say, DFS and BFS, you need to iterate through all neighbors of a specific node, so LinkedList is the simplest way and it is sufficient.
But, if you need to check if a specific edge exists, it will take you worst case O(n) time because you need to iterate through the whole list (An Adjacency Matrix implementation would make this op O(1))
A LinkedList of adjacent nodes can be sufficient, it depends on what you are going to do.
If you need to know what nodes are adjacent to one another, you could use an adjacency matrix. In other words, for a graph of n nodes, you have an n x n matrix whose entry for (i,j) is 1 if i and j are next to each other in the graph.

Resources