How do databases implement skipping? - database

I'm writing my own little database engine. Is there any efficient way to implement skipping function other than inspecting every leaf nodes of B+tree, which will be slow with large entries.

If you are using a B+tree for your indices, all the values are stored in leaves and thus can be linked together to form an (ordered) linked list, or rather an unrolled linked list. That's the main advantage of B+ tree over plain B trees.
That said, even if unrolled lists let you do some form of skipping, nothing prevents you from implementing skip lists on your records, and using the the nodes of these lists as your btree values.

2 years after, but anyway.
You can do it in Cassandra's way too. No limit, but you specify the last key from previous query, e.g.
select * from abc where key > 123 limit 100
where 123 is last key from previous query

Related

Optimal tree data structure for merging real-time sequential data

I am working on a project in C where I have to store, sort and update the information obtained in real time. The maximum amount of information I can have is defined. The information obtained is <key,value1,value2>, but is sorted according to the key and value1. The key would indicate the start of the node and value1 would indicate its size.
The basic operations I would need to perform here are insertion, search, deletion but most importantly merge if I find a sequential information.
As an example, in an empty structure, I input <100,1> - this will create one node.
Next if I input <102,2> - this will create another node. So 2 nodes would exist in the tree.
Next if I input <101,1> - this should at the end create only one node in the tree: <100,4>
I also want to separately sort these nodes according to their value2. Note that value2 could be updated in real-time.
I was thinking about B+ tree because of its logarithmic performance in all cases since all leaf nodes are at the same level. And by the use of separate doubly linked list, I can create separate links to sort the nodes according to value2.
But the overhead for sorting according to <key,value1> would be I always have to do a search operation; for merge, for add and for delete.
Any thoughts/suggestions about this?

What are the advantages of B plus tree over skip list?

So recently I am learning database system on Youtube published by CMU. The professor said skip list is like the rotated B+ tree. And it seems that the time complexity of operations on b+ tree and skip list is nearly the same. Plus, skip list does not need to store intermediate nodes like B+ tree does. However, rare DBMS uses skip list to store indexes of tables. So what are the advantages of B+ tree over skip list?

C Find a subset of strings from a list

I have a list of 1M entries and I want to exclude a subset of 20,000 of these entries (the two lists are in different order by have the same key (string)). Can anyone suggest a quick search algorithm in C to do this?
I dont want to have to read each of the 20K IDs and look through the list of 1M every time. Any suggestions would be most helpful.
Thank you.
What you want to use is a hash set. A hash set is a special case of a hash table that basically records if an element exists in the set or not, in constant time. So, what you would do is insert your 20k IDs into the hash set, and then run through the 1 million strings and see if they exist in the hash set.
For your reference, here's an implementation of a hash set in C: https://github.com/avsej/hashset.c
Your running time would be O(n), since for each check for the 1M strings, it would be constant time.
Sort both lists first. Then you can traverse them together, advancing the pointer in the list that is behind the pointer in the other.
Do you have to use C? This sounds like a job for Perl.
Put the 20,000 keys to be included in a hash table before you start searching the list. Then for each item's key in the list, if that key is found in the hash table, exclude the item from the list.

how to remember multiple indexes in a buffer to later access them for modification one by one...keeping optimization in mind

i have a scenario where i have to set few records with field values to a constant and then later access them one by one sequentially .
The records can be random records.
I dont want to use link list as it will be costly and don't want to traverse the whole buffer.
please give me some idea to do that.
When you say "set few records with field values to a constant" is this like a key to the record? And then "later access them one by one" - is this to recall them with some key? "one-by-one sequentially" and "don't want to traverse the whole buffer" seems to conflict, as sequential access sounds a lot like traversal.
But I digress. If you in fact do have a key (and it's a number), you could use some sort of Hash Table to organize your records. One basic implementation might be an array of linked lists, where you mod the key down into the array's range, then add it to the list there. This might increase performance assuming you have a good distribution of keys (your records spread across the array well).
Another data structure to look into might be a B-Tree or a binary search tree, which can access nodes in logarithmic time.
However, overall I agree with the commenters that over-optimizing is usually not a good idea.

C Database Design, Sortable by Multiple Fields

If memory is not an issue for my particular application (entry, lookup, and sort speed being the priorities), what kind of data structure/concept would be the best option for a multi-field rankings table?
For example, let's say I want to create a Hall of Fame for a game, sortable by top score (independent of username), username (with all scores by the same user placed together before ranking users by their highest scores), or level reached (independent of score or name). In this example, if I order a linked list, vector, or any other sequential data structure by the top score of each player, it makes searching for the other fields -- like level and non-highest scores -- more iterative (i.e. iterate across all looking for the stage, or looking for a specific score-range), unless I conceive some other way to store the information sorted when I enter new data.
The question is whether there is a more efficient (albeit complicated and memory-consumptive) method or database structure in C/C++ that might be primed for this kind of multi-field sort. Linked lists seem fine for simple score rankings, and I could even organize a hashtable by hashing on a single field (player name, or level reached) to sort by a single field, but then the other fields take O(N) to find, worse to sort. With just three fields, I wonder if there is a way (like sets or secondary lists) to prevent iterating in certain pre-desired sorts that we know beforehand.
Do it the same way databases do it: using index structures. You have your main data as a number of records (structs), perhaps ordered according to one of your sorting criteria. Then you have index structures, each one ordered according to one of your other sorting criteria, but these index structures don't contain copies of all the data, just pointers to the main data records. (Think "index" like the index in a book, with page numbers "pointing" into the main data body.)
Using ordered linked list for your index structures will give you a fast and simple way to go through the records in order, but it will be slow if you need to search for a given value, and similarly slow when inserting new data.
Hash tables will have fast search and insertion, but (with normal hash tables) won't help you with ordering at all.
So I suggest some sort of tree structure. Balanced binary trees (look for AVL trees) work well in main memory.
But don't forget the option to use an actual database! Database managers such as MySQL and SQLite can be linked with your program, without a separate server, and let you do all your sorting and indexing very easily, using SQL embedded in your program. It will probably execute a bit slower than if you hand-craft your own main-memory data structures, or if you use main-memory data structures from a library, but it might be easier to code, and you won't need to write separate code to save the data on disk.
So, you already know how to store your data and keep it sorted with respect to a single field. Assuming the values of the fields for a single entry are independent, the only way you'll be able to get what you want is to keep three different lists (using the data structure of your choice), each of which are sorted to a different field. You'll use three times the memory's worth of pointers of a single list.
As for what data structure each of the lists should be, using a binary max heap will be effective. Insertion is lg(N), and displaying individual entries in order is O(1) (so O(N) to see all of them). If in some of these list copies the entries need to be sub-sorted by another field, just consider that in the comparison function call.

Resources