Why are there two leaf nodes in this B-Tree lookup? - database

In this
graphic, we are looking up employee_id 123 and subsidary_id 20 in a B-Tree (from a tutorial on database indexes). There are two leaf nodes branching off the tree. Is this purely demonstration, or is there something I'm missing, as I would think that the only leaf node that would need to be checked would be the top one, as it has employee_id max 123 and subsidary_id max 27.

The diagram itself isn't showing the actions of that particular search, rather it's showing a localised portion of the tree, so I wouldn't read too much into it for the specific query.
You're absolutely correct in that searching for the key 123-20, you never need to follow the link to the second leaf node (either through the hierarchical link from the left, or the sequential link from above).
However (and this is hard to tell without seeing the source material), it's quite likely that this diagram may be used for something else as well.
The fact that it shows links between consecutive leaf nodes means that it would be quite easy to use the index lookup to locate a specific entry, then sequentially process them.
By that I mean a query like "give me every record with an employee ID of 123", or "give me all records with an employee ID between 123 and 456", or "give me all records in employeeID/subsidiaryID order".
All those queries would entail finding a specific record using the hierarchical path (though the final one may have a quicker path direct to the first record), then following the sequential path for subsequent records.
In addition, the fact that the subsidiary IDs of 20 are both red means that it would be an ideal opportunity to educate the reader on the fact that the employee-subsidiary index is not necessarily the best one for all queries. In other words, an efficient query "give me all records from subsidiary 20" would be better with another index (one containing simply the subsidiary ID).
That'd be my best guess, it would be worthwhile looking at the tutorial to see if that diagram is used for something else.
Of course, it could be that the person putting together the tutorial couldn't be bothered creating a new graphic so simply used one from a different question, or an earlier iteration of the tutorial :-) I've been guilty of that before.

Related

how to remember multiple indexes in a buffer to later access them for modification one by one...keeping optimization in mind

i have a scenario where i have to set few records with field values to a constant and then later access them one by one sequentially .
The records can be random records.
I dont want to use link list as it will be costly and don't want to traverse the whole buffer.
please give me some idea to do that.
When you say "set few records with field values to a constant" is this like a key to the record? And then "later access them one by one" - is this to recall them with some key? "one-by-one sequentially" and "don't want to traverse the whole buffer" seems to conflict, as sequential access sounds a lot like traversal.
But I digress. If you in fact do have a key (and it's a number), you could use some sort of Hash Table to organize your records. One basic implementation might be an array of linked lists, where you mod the key down into the array's range, then add it to the list there. This might increase performance assuming you have a good distribution of keys (your records spread across the array well).
Another data structure to look into might be a B-Tree or a binary search tree, which can access nodes in logarithmic time.
However, overall I agree with the commenters that over-optimizing is usually not a good idea.

Implementing a database -- How to get started

I've been trying to learn programming for a while. I've studied Java and Python, and I'm comfortable with their syntax. Recently, I wanted to use what I've learnt with coding a tangible software from ground up.
I want to implement a database engine, sort of a NoSQL database. I've put together a small document, sort of a specification to follow throughout my adventure of coding it. But all I know is a bunch of keywords. I don't know where to start.
Can someone help me find out how to gather the knowledge I need for this kind of work and in what order to learn things? I have searched for documents, but I feel like I'll end up finding unrelated/erroneous content or start from a wrong point, because implementing a complete database engine is (seeming to be) a truly complicated task.
I wan't to express that I'd prefer theses and whitepapers and (e)books to codes of other projects, because I've asked a question of kind in which people usually get answered in the form of "read project - x' source code". I'm not at the level of comfortably reading and understanding source code.
First, you may have a look that the answers for How to write a simple database engine. While it focus on a SQL engine, there is still a lot of good material in the answers.
Otherwise, a good project tutorial is Implementation of a B-Tree Database Class. The example code is in C++, but the description of what is done and why is probably what you'll want to look at anyway.
Also, there is Designing and Implementing Structured Storage (Database Engine) over at MSDN. Plenty of information there to help you in your learning project.
Because the accepted answer only offers (good) links to other resources, I'd thought I share my experience writing webdb, a small experimental database for browsers. I also invite you to read the source code. It's pretty small. You should be able to read through it and get a basic understanding of what it's doing in a couple of hours. Warning: I am a n00b at this and since writing it I learned a lot more about it and see I have been doing some things wrong. It can help you get started though.
The basics: BTree
I started out with adapting an AVL tree to suit my needs. An AVL tree is a kind of self-balancing binary search tree. You store the key K and related data (if any) in a node, then all items with key < K in a node in the left subtree and all items with key > K in a right subtree. You can use an array to store the data items if you want to support non unique keys.
This tree will give you the basics: Create, Update, Delete and a way to quickly get an item by key, or all items with key < x, or with key between x and y etc. It can serve as the index for our table.
A schema
As a next step I wrote code that lets the client code define a schema. Methods like createTable() etc. Schemas are typically associated with SQL, but even no-SQL sort-of has a schema; they usually require you to mark the ID field and any other fields you want to search on. You can make your schema as fancy as you want, but you typically want to model at least which column(s) serve as primary key and which fields will be searched on frequently and need an index.
Creating a data structure to store a table
I decided to use the tree I created in the first step to store my items. These were simple JS objects. Having defined which field contains the PK, I could simply insert the item into the tree using that field's value as the key. This gives me quick lookup by ID (range).
Next I added another tree for every column that needs an index. In these trees I did not store the full record, but only the key. So to fetch a customer by last name, I would first use the index on last name to get the ID, then the primary key index to get the actual record. The reason I did not just store the (reference to the) actual object is because it makes set operations a little bit simpler (see next step)
Querying
Now that we have a table with indexes for PK and search fields, we can implement querying. I did not take this very far as it becomes complicated quickly, but you can get some nice functionality with just some basics. WebDB does not implement joins; all queries operate only on a single table. But once you understand this you see a pretty clear (though long and winding) path to doing joins and other complicated stuff as well.
In WebDB, to get all customers with firstName = 'John' and city = 'New York' (assuming those are two search fields), you would write something like:
var webDb = ...
var johnsFromNY = webDb.customers.get({
firstName: 'John',
city: 'New York'
})
To solve it, we first do two lookups: we get the set X of all IDs of customers named 'John' and we get the set Y of all IDs of customers from New York. We then perform an intersection on these two sets to get all IDs of customers that are both named 'John' AND from New York. We then run through our set of resulting IDs, getting the actual record for each one and adding it to the result array.
Using the set operators like union and intersection we can perform AND and OR searches. I only implemented AND.
Doing joins would (I think) involve creating temporary tables in memory, then populating them as the query runs with the joined results, then applying the query criteria to the temp table. I never got there. I attempted some syncing logic next but that was too ambitious and it went downhill from there :)

Determining the Similarity Between Items in a Database

We have a database with hundreds of millions of records of log data. We're attempting to 'group' this log data as being likely to be of the same nature as other entries in the log database. For instance:
Record X may contain a log entry like:
Change Transaction ABC123 Assigned To Server US91
And Record Y may contain a log entry like:
Change Transaction XYZ789 Assigned To Server GB47
To us humans those two log entries are easily recognizable as being likely related in some way. Now, there may be 10 million rows between Record X and Record Y. And there may be thousands of other entries that are similar to X and Y, and some that are totally different but that have other records they are similar to.
What I'm trying to determine is the best way to group the similar items together and say that with XX% certainty Record X and Record Y are probably of the same nature. Or perhaps a better way of saying it would be that the system would look at Record Y and say based on your content you're most like Record X as apposed to all other records.
I've seen some mentions of Natural Language Processing and other ways to find similarity between strings (like just brute-forcing some Levenshtein calculations) - however for us we have these two additional challenges:
The content is machine generated - not human generated
As opposed to a search engine approach where we determine results for a given query - we're trying to classify a giant repository and group them by how alike they are to one another.
Thanks for your input!
Interesting problem. Obviously, there's a scale issue here because you don't really want to start comparing each record to every other record in the DB. I believe I'd look at growing a list of "known types" and scoring records against the types in that list to see if each record has a match in that list.
The "scoring" part will hopefully draw some good answers here -- your ability to score against known types is key to getting this to work well, and I have a feeling you're in a better position than we are to get that right. Some sort of soundex match, maybe? Or if you can figure out how to "discover" which parts of new records change, you could define your known types as regex expressions.
At that point, for each record, you can hopefully determine that you've got a match (with high confidence) or a match (with lower confidence) or very likely no match at all. In this last case, it's likely that you've found a new "type" that should be added to your "known types" list. If you keep track of the score for each record you matched, you could also go back for low-scoring matches and see if a better match showed up later in your processing.
I would suggest indexing your data using a text search engine like Lucene to split your log entries into terms. As your data is machine generated use also word bigrams and tigrams, even higher order n-grams. A bigram is just a sequence of consecutive words, in your example you would have the following bigrams:
Change_Transaction, Transaction_XYZ789, XYZ789_Assigned, Assigned_To, To_Server, Server_GB47
For each log prepare queries in a similar way, the search engine may give you the most similar results. You may need to tweek the similarity function a bit to obtain best results but I believe this is a good start.
Two main strategies come to my mind here:
the ad-hoc one. Use an information retrieval approach. Build an index for the log entries, eventually using a specialized tokenizer/parser, by feeding them into a regular text search engine. I've heard people do this with Xapian and Lucene. Then you can "search" for a new log record and the text search engine will (hopefully) return some related log entries to compare it with. Usually the "information retrieval" approach is however only interested in finding the 10 most similar results.
the clustering approach. You will usually need to turn the data into numerical vectors (that may however be sparse) e.g. as TF-IDF. Then you can apply a clustering algorithm to find groups of closely related lines (such as the example you gave above), and investigate their nature. You might need to tweak this a little, so it doesn't e.g. cluster on the server ID.
Both strategies have their ups and downs. The first one is quite fast, however it will always just return you some similar existing log lines, without much quantities on how common this line is. It's mostly useful for human inspection.
The second strategy is more computationally intensive, and depending on your parameters could fail completely (so maybe test it on a subset first), but could also give more useful results by actually building large groups of log entries that are very closely related.
It sounds like you could take the lucene approach mentioned above, then use that as a source for input vectors into the machine learning library Mahout (http://mahout.apache.org/). Once there you can train a classifier, or just use one of their clustering algorithms.
If your DBMS has it, take a look at SOUNDEX().

Storing Composite Patterns (Hierarchical Data) in Database

What are 'best-practices' for saving Composite patterns in a Relational Database?
We have been using Modified Preorder Tree Traversal. This is very quick to build the whole tree, but very slow to insert or delete new nodes (all left and right values need to be adjusted). Also querying the children of a node is not easy and very slow.
Another thing we noticed is that you really have to make sure the tree doesn't get messy. You need transaction locks, otherwise the left and right values can get corrupt, and fixing a corrupt left right tree is not an easy job.
It does work very good however, the Modified Preorder Tree Traversal, but I was wondering if there are better alternatives.
While finding all descendents of a row with MPTT is fast, finding all children can be slow. However you should be able to fix that by adding a parent_id field to your table that records (yes, redundantly) the parent of the row. Then the search becomes:
SELECT *
FROM tbl
WHERE parent_id = z
Yes, parent_id contains redundant information, potentially denormalizing your table -- but since any insert/update/delete already requires global changes, keeping parent_id up-to-date isn't much extra to pay. You could alternatively use a level field that records the vertical level of the row, although that is in fact more likely to change under certain types of transformations (e.g. moving a subtree to a different point in the tree).
The plain old link-to-parent representation (i.e. just having parent_id and no left_pos or right_pos), is of course faster for insert/update-heavy workloads, but the only queries it can answer efficiently are "Find the parent of X" and "Find the children of X." Most workloads involve much more reading than writing, so usually MPTT is faster overall -- but perhaps in your case you need to consider moving ("back") to link-to-parent?
The best way to store hierakial data in a database I have heard is to use a string attribute where the content is the list of parents separated by, say colons.

Best data structure in C for these two situations?

I kinda need to decide on this to see if I can achieve it in a couple of hours before the deadline for my school project is due but I don't understand much about data structures and I need suggestions...
There's 2 things I need to do, they will probably use different data structures.
I need a data structure to hold profile records. The profiles must be search able by name and social security number. The SSN is unique, so I probably can use that for my advantage? I suppose hash maps is the best bet here? But how do I use the SSN in an hash map to use that as an advantage in looking for a specific profile? A basic and easy to understand explanation would be much appreciated.
I need a data structure to hold records about cities. I need to know which are cities with most visitors, cities less visited and the clients (the profile is pulled from the data structure in #1 for data about the clients) that visit a specific city.
This is the third data structure I need for my project and it's the data structure that I have no idea where to begin. Suggestions as for which type of data structure to use are appreciated, if possible, with examples on how to old the data above in bold.
As a note:
The first data structure is already done (I talked about it in a previous question). The second one is posted here on #1 and although the other group members are taking care of that I just need to know if what we are trying to do is the "best" approach. The third one is #2, the one I need most help.
The right answer lies anywhere between a balanced search tree and an array.
The situation you have mentioned here and else-thread misses out on a very important point: The size of the data you are handling. You choose your data structure and algorithm(s) depending on the amount of data you have to handle. It is important that you are able to justify your choice(s). Using a less efficient general algorithm is not always bad. Being able to back up your choices (e.g: choosing bubble-sort since data size < 10 always) shows a) greater command of the field and b) pragmatism -- both of which are in short supply.
For searchability across multiple keys, store the data in any convenient form, and provides fast lookup indexes on the key(s).
This could be as simple as keeping the data in an array (or linked list, or ...) in the order of creation, and keeping a bunch of {hashtables|sorted arrays|btrees} of maps (key, data*) for all the interesting keys (SSN, name, ...).
If you had more time, you could even work out how to not have a different struct for each different map...
I think this solution probably applies to both your problems.
Good luck.
For clarity:
First we have a simple array of student records
typedef
struct student_s {
char ssn[10]; // nul terminated so we can use str* functions
char name[100];
float GPA;
...
} student;
student slist[MAX_STUDENTS];
which is filled in as you go. It has no order, so search on any key is a linear time operation. Not a problem for 1,000 entries, but maybe a problem for 10,000, and certainly a problem for 1 million. See dirkgently's comments.
If we want to be able to search fast we need another layer of structure. I build a map between a key and the main data structure like this:
typedef
struct str_map {
char* key;
student *data;
} smap;
smap skey[MAX_STUDENTS]
and maintain skey sorted on the key, so that I can do fast lookups. (Only an array is a hassle to keep sorted, so we probably prefer a tree, or a hashmap.)
This complexity isn't needed (and should certainly be avoided) if you will only want fast searches on a single field.
Outside of a homework question, you'd use a relational database for
this. But that probably doesn't help you…
The first thing you need to figure out, as others have already pointed
out, is how much data you're handling. An O(n) brute-force search is
plenty fast as long a n is small. Since a trivial amount of data would
make this a trivial problem (put it in an array, and just brute-force
search it), I'm going to assume the amount of data is large.
Storing Cities
First, your search requirements appear to require the data sorted in
multiple ways:
Some city unique identifier (name?)
Number of visitors
This actually isn't too hard to satisfy. (1) is easiest. Store the
cities in some array. The array index becomes the unique identifier
(assumption: we aren't deleting cities, or if we do delete cities we can
just leave that array spot unused, wasting some memory. Adding is OK).
Now, we also need to be able to find most & fewest visits. Assuming
modifications may happen (e.g., adding cities, changing number of
visitors, etc.) and borrowing from relational databases, I'd suggest
creating an index using some form of balanced tree. Databases would
commonly use a B-Tree, but different ones may work for you: check Wikipedia's
article on trees. In each tree node, I'd just keep a pointer (or
array index) of the city data. No reason to make another copy!
I recommend a tree over a hash for one simple reason: you can very
easily do a preorder or reverse order traversal to find the top or
bottom N items. A hash can't do that.
Of course, if modifications may not happen, just use another array (of
pointers to the items, once again, don't duplicate them).
Linking Cities to Profiles
How to do this depends on how you have to query the data, and what form
it can take. The most general is that each profile can be associated
with multiple cities and each city can be associated with multiple
profiles. Further, we want to be able to efficiently query from either
direction — ask both "who visits Phoenix?" and "which cities does Bob
visit?".
Shamelessly lifting from databases again, I'd create another data
structure, a fairly simple one along the lines of:
struct profile_city {
/* btree pointers here */
size_t profile_idx; /* or use a pointer */
size_t city_idx; /* for both indices */
};
So, to say Bob (profile 4) has visited Phoenix (city 2) you'd have
profile_idx = 4 and city_idx = 2. To say Bob has visited Vegas (city
1) as well, you'd add another one, so you'd have two of them for Bob.
Now, you have a choice: you can store these either in a tree or a
hash. Personally, I'd go with the tree, since that code is already
written. But a hash would be O(n) instead of O(logn) for lookups.
Also, just like we did for the city visit count, create an index for
city_idx so the lookup can be done from that side too.
Conclusion
You now have a way to look up the 5 most-visited cities (via an in-order
traversal on the city visit count index), and find out who visits those
cities, by search for each city in the city_idx index to get the
profile_idx. Grab only unique items, and you have your answer.
Oh, and something seems wrong here: This seems like an awful lot of code for your instructor to want written in several hours!

Resources