Storing Composite Patterns (Hierarchical Data) in Database - database

What are 'best-practices' for saving Composite patterns in a Relational Database?
We have been using Modified Preorder Tree Traversal. This is very quick to build the whole tree, but very slow to insert or delete new nodes (all left and right values need to be adjusted). Also querying the children of a node is not easy and very slow.
Another thing we noticed is that you really have to make sure the tree doesn't get messy. You need transaction locks, otherwise the left and right values can get corrupt, and fixing a corrupt left right tree is not an easy job.
It does work very good however, the Modified Preorder Tree Traversal, but I was wondering if there are better alternatives.

While finding all descendents of a row with MPTT is fast, finding all children can be slow. However you should be able to fix that by adding a parent_id field to your table that records (yes, redundantly) the parent of the row. Then the search becomes:
SELECT *
FROM tbl
WHERE parent_id = z
Yes, parent_id contains redundant information, potentially denormalizing your table -- but since any insert/update/delete already requires global changes, keeping parent_id up-to-date isn't much extra to pay. You could alternatively use a level field that records the vertical level of the row, although that is in fact more likely to change under certain types of transformations (e.g. moving a subtree to a different point in the tree).
The plain old link-to-parent representation (i.e. just having parent_id and no left_pos or right_pos), is of course faster for insert/update-heavy workloads, but the only queries it can answer efficiently are "Find the parent of X" and "Find the children of X." Most workloads involve much more reading than writing, so usually MPTT is faster overall -- but perhaps in your case you need to consider moving ("back") to link-to-parent?

The best way to store hierakial data in a database I have heard is to use a string attribute where the content is the list of parents separated by, say colons.

Related

Why are there two leaf nodes in this B-Tree lookup?

In this
graphic, we are looking up employee_id 123 and subsidary_id 20 in a B-Tree (from a tutorial on database indexes). There are two leaf nodes branching off the tree. Is this purely demonstration, or is there something I'm missing, as I would think that the only leaf node that would need to be checked would be the top one, as it has employee_id max 123 and subsidary_id max 27.
The diagram itself isn't showing the actions of that particular search, rather it's showing a localised portion of the tree, so I wouldn't read too much into it for the specific query.
You're absolutely correct in that searching for the key 123-20, you never need to follow the link to the second leaf node (either through the hierarchical link from the left, or the sequential link from above).
However (and this is hard to tell without seeing the source material), it's quite likely that this diagram may be used for something else as well.
The fact that it shows links between consecutive leaf nodes means that it would be quite easy to use the index lookup to locate a specific entry, then sequentially process them.
By that I mean a query like "give me every record with an employee ID of 123", or "give me all records with an employee ID between 123 and 456", or "give me all records in employeeID/subsidiaryID order".
All those queries would entail finding a specific record using the hierarchical path (though the final one may have a quicker path direct to the first record), then following the sequential path for subsequent records.
In addition, the fact that the subsidiary IDs of 20 are both red means that it would be an ideal opportunity to educate the reader on the fact that the employee-subsidiary index is not necessarily the best one for all queries. In other words, an efficient query "give me all records from subsidiary 20" would be better with another index (one containing simply the subsidiary ID).
That'd be my best guess, it would be worthwhile looking at the tutorial to see if that diagram is used for something else.
Of course, it could be that the person putting together the tutorial couldn't be bothered creating a new graphic so simply used one from a different question, or an earlier iteration of the tutorial :-) I've been guilty of that before.

Maintain versions of a tree

I have a tree datastructure having upto 1000 nodes at a particular level ( max depth of 8-9 levels).
I need to maintain versions of this entire tree. A version is created after some processing happens. Between these versions, the data in the nodes may change ( not more than a 100 or so).
As of now, I was cloning the entire tree for each new version, but the space consumption is huge after a few versions. I cannot entirely delete the previous version records since I need to keep a track of the changes.
Whats the optimal way to store these versions in the database? (If not the db, any alternative way).
This is not a very straightforward problem, but it is a solved problem. In general, data structures that remember their history are called persistent data structures.
The linked Wikipedia page has an example of a persistent tree that you should look at.
The path copying approach is fairly simple to implement but doesn't have performance as good as is possible.
Keep each unique node permanently frozen in one table (once you insert a node in there, never edit or delete it). If you need to change a node even slightly, insert this modified node in your table. Then, keep track of your tree versions with foreign keys to the node table. That should require a trivial amount of space per tree.
Store each "version" as a map between the nodes that are changed and the old/new values.
You can reconstruct any previous version by inverting the sequence of operations.
a possible practical solution might be:
if you need to restore an old version:
serialize the new tree and the previous tree into XML.
diff the new XML with the previous XML, serialize and store the difference in a database
(for a Java solution I found http://diffxml.sourceforge.net and XMLUnit, but it
has to be checked if they are able to compute the difference between the new XML
and the previous XML in a way which permits to restore easily the previous XML from
the new XML).
each time an old version is needed, one successively takes the differences from the
database, from the most recent to the most remote, and applies them in this order to
the (serialized into XML) current tree, to get to the XML form of the old version of the tree.
if you do not need to reconstruct the old version, then simply use XMLUnit to compute
the differences and store their serializations in the database.
Since you care about previous versions of the tree and space is your primary concern, assuming that from one version to the other the treas are not entirely different you can store only the difference between the treas. How to do that is entirely up to you:
- you can do an in/pre/post order parsing of the tree (assuming it's binary) and come up with a logic to get from the location of on difference to the other
-or use a linked list to store only the differences + some logic to rebuild the treas

Hierarchical SQL data (Recursive CTE vs HierarchyID vs closure table)

I have a set of hierarchical data being used in a SQL Server database. The data is stored with a guid as the primary key, and a parentGuid as a foreign key pointing to the objects immediate parent. I access the data most often through Entity Framework in a WebApi project. To make the situation a little more complex I also need to manage permission based on this hierarchy such that a permission applied to a parent applies to all of its descendants. My question is this:
I have searched all over and cannot decide which would be best to handle this situation. I know I have the following options.
I can create Recursive CTEs, Common Table Expression, (aka RCTE) to handle the hierarchical data. This seems to be the most simple approach for normal access, but I'm worried it may be slow when used to determine permission levels for child objects.
I can create a hierarchyId data type field in the table and use SQL Server provided functions such as GetAncestor(), IsDescendantOf(), and etc. This seems like it would make querying fairly easy, but seems to require a fairly complex insert/update trigger to keep the hierarchyId field correct through inserts and moves
I can create a closure table, which would store all of the relationships in the table. I imagine it as such: parent column and child column, each parent -> child relationship would be represented. (ie 1->2 2->3 would be represented in the database as 1-2, 1-3, 2-3). The downside is that this requires insert, update, and delete triggers even though they are fairly simple, and this method generates a lot of records.
I have tried searching all over and can't find anything giving any advice between these three methods.
PS I am also open to any alternative solutions to this problem
I have used all three methods. It's mostly a question of taste.
I agree that hierarchy with parent-child relationships in the table is the simplest. Moving a subtree is simple and it's easy to code the recursive access with CTEs. Performance is only going to be an issue if you have very large tree structures and you are frequently accessing the hierarchical data. For the most part, recursive CTEs are very fast when you have the correct indexes on the table.
The closure table is more like a supplement to the above. Finding all the descendants of a given node is lightning fast, you don't need the CTEs, just one extra join, so it's sweet. Yes, the number of records blows up, but I think it is no more than N-1 times the number of nodes for a tree of depth N (e.g. a tertiary tree of depth 5 would require 1 + 3 + 9 + 27 + 81 = 121 connections when storing only the parent-child relationship vs. 1 + 3 + (9 * 2) + (27 * 3) + (81 * 4) = 427 for the closure table). In addition, the closure table records are so narrow (just 2 ints at a minimum) that they take up almost no space. Generating the list of records to insert into the closure table when a new record is inserted into the hierarchy takes a tiny bit of overhead.
I personally like HierarchyId since it really combines the benefit of the above two, which is compact storage, and lightning fast access. Once you get it set up, it is easy to query and takes very little space. As you mentioned, it's a little tricky to move subtrees around, but it's manageable. Anyway, how often do you really move a subtree in a hierarchy? There are some links you can find that will suggest some methods, e.g.:
http://sqlblogcasts.com/blogs/simons/archive/2008/03/31/SQL-Server-2008---HierarchyId---How-do-you-move-nodes-subtrees-around.aspx
The main drawback I have found to hierarchyId is the learning curve. It's not as obvious how to work with it as the other two methods. I have worked with some very bright SQL developers who would frequently get snagged on it, so you end up with one or two resident experts who have to field questions from everyone else.

How to represent a tree like structure in a db

I'm starting a project and I'm in the designing phase: I.e., I haven't decided yet on which db framework I'm going to use. I'm going to have code that creates a "forest" like structure. That is, many trees, where each tree is a standard: nodes and edges. After the code creates these trees I want to save them in the db. (and then pull them out eventually)
The naive approach to representing the data in the db is a relational db with two tables: nodes and edges. That is, the nodes table will have a node id, node data, etc.. And the edges table will be a mapping of node id to node id.
Is there a better approach? Or given the (limited) assumptions I'm giving this is the best approach? How about if we add an assumption that the trees are relatively small - is it better to save the whole tree as a blob in the db? Which type of db should I use in that case? Please comment on speed/scalability.
Thanks
I showed a solution similar to your nodes & edges tables, in my answer to the StackOverflow question: What is the most efficient/elegant way to parse a flat table into a tree? I call this solution "Closure Table".
I did a presentation on different methods of storing and using trees in SQL, Models for Hierarchical Data with SQL and PHP. I demonstrated that with the right indexes (depending on the queries you need to run), the Closure Table design can have very good performance, even over large collections of edges (about 500K edges in my demo).
I also covered the design in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
Be sure to use some sort of low level-coding for the entity being treed to prevent looping. The entity might be a part, subject, folder, etc.
With an Entity file and and Entity-Xref file you can loop through one of say two relationships between the two files, a parent and a child relation.
A level is the level an entity found in a tree. A low-level-code for the entity is the lowest level an entity is found in any tree anywhere. Check to make sure the low level code of the entity you want to make a child is less than or equal to prevent a loop. after adding an entity as a child it will become at least one level lower.

Tree structures in a nosql database

I'm developing an application for Google App Engine which uses BigTable for its datastore.
It's an application about writing a story collaboratively. It's a very simple hobby project that I'm working on just for fun. It's open source and you can see it here: http://story.multifarce.com/
The idea is that anyone can write a paragraph, which then needs to be validated by two other people. A story can also be branched at any paragraph, so that another version of the story can continue in another direction.
Imagine the following tree structure:
Every number would be a paragraph. I want to be able to select all the paragraphs in every unique story line. Basically, those unique story lines are (2, 7, 2); (2, 7, 6, 5); (2, 7, 6, 11) and (2, 5, 9, 4). Ignore that the node "2" appears twice, I just took a tree structure diagram from Wikipedia.
I also made a diagram of a proposed solution: https://docs.google.com/drawings/edit?id=1fdUISIjGVBvIKMSCjtE4xFNZxiE08AoqvJSLQbxN6pc&hl=en
How can I set up a structure is performance efficient both for writing, but most importantly for reading?
There are a number of well known ways to represent trees in databases; each of them have their pros and cons. Here are the most common:
Adjacency list, where each node stores the ID of its parent.
Materialized path, which is the strategy Keyur describes. This is also the approach used by entity groups (eg, parent entities) in App Engine. It's also more or less what you're describing in your update.
Nested sets, where each node has 'left' and 'right' IDs, such that all child nodes are contained in that range.
Adjacency lists agumented with a root ID.
Each of these has its own advantages and disadvantages. Adjacency lists are simple, and cheap to update, but require multiple queries to retrieve a subtree (one for each parent node). Augmented adjacency lists make it possible to retrieve an entire tree by storing the ID of the root node in every record.
Materialized paths are easy to implement and cheap to update, and permit querying arbitrary subtrees, but impose increasing overhead for deep trees.
Nested sets are tougher to implement, and require updating, on average, half the nodes each time you make an insertion. They allow you to query arbitrary subtrees, without the increasing key length issue materialized path has.
In your specific case, though, it seems like you don't actually need a tree structure at all: each story, branched off an original though it may be, stands alone. What I would suggest is having a 'Story' model, which contains a list of keys of its paragraphs (Eg, in Python a db.ListProperty(db.Key)). To render a story, you fetch the Story, then do a batch fetch for all the Paragraphs. To branch a story, simply duplicate the story entry - leaving the references to Paragraphs unchanged.
One solution I can think about is - the path to the node is also the key of that node. So the key of node 11 is "2/7/6/11". You can traverse the path by a simple key lookup of all keys in the path - "2/7/6/11", "2/7/6", "2/7", "2"

Resources