Hierarchical SQL data (Recursive CTE vs HierarchyID vs closure table)

Hierarchical SQL data (Recursive CTE vs HierarchyID vs closure table) - sql-server

I have a set of hierarchical data being used in a SQL Server database. The data is stored with a guid as the primary key, and a parentGuid as a foreign key pointing to the objects immediate parent. I access the data most often through Entity Framework in a WebApi project. To make the situation a little more complex I also need to manage permission based on this hierarchy such that a permission applied to a parent applies to all of its descendants. My question is this:
I have searched all over and cannot decide which would be best to handle this situation. I know I have the following options.
I can create Recursive CTEs, Common Table Expression, (aka RCTE) to handle the hierarchical data. This seems to be the most simple approach for normal access, but I'm worried it may be slow when used to determine permission levels for child objects.
I can create a hierarchyId data type field in the table and use SQL Server provided functions such as GetAncestor(), IsDescendantOf(), and etc. This seems like it would make querying fairly easy, but seems to require a fairly complex insert/update trigger to keep the hierarchyId field correct through inserts and moves
I can create a closure table, which would store all of the relationships in the table. I imagine it as such: parent column and child column, each parent -> child relationship would be represented. (ie 1->2 2->3 would be represented in the database as 1-2, 1-3, 2-3). The downside is that this requires insert, update, and delete triggers even though they are fairly simple, and this method generates a lot of records.
I have tried searching all over and can't find anything giving any advice between these three methods.
PS I am also open to any alternative solutions to this problem

I have used all three methods. It's mostly a question of taste.
I agree that hierarchy with parent-child relationships in the table is the simplest. Moving a subtree is simple and it's easy to code the recursive access with CTEs. Performance is only going to be an issue if you have very large tree structures and you are frequently accessing the hierarchical data. For the most part, recursive CTEs are very fast when you have the correct indexes on the table.
The closure table is more like a supplement to the above. Finding all the descendants of a given node is lightning fast, you don't need the CTEs, just one extra join, so it's sweet. Yes, the number of records blows up, but I think it is no more than N-1 times the number of nodes for a tree of depth N (e.g. a tertiary tree of depth 5 would require 1 + 3 + 9 + 27 + 81 = 121 connections when storing only the parent-child relationship vs. 1 + 3 + (9 * 2) + (27 * 3) + (81 * 4) = 427 for the closure table). In addition, the closure table records are so narrow (just 2 ints at a minimum) that they take up almost no space. Generating the list of records to insert into the closure table when a new record is inserted into the hierarchy takes a tiny bit of overhead.
I personally like HierarchyId since it really combines the benefit of the above two, which is compact storage, and lightning fast access. Once you get it set up, it is easy to query and takes very little space. As you mentioned, it's a little tricky to move subtrees around, but it's manageable. Anyway, how often do you really move a subtree in a hierarchy? There are some links you can find that will suggest some methods, e.g.:
http://sqlblogcasts.com/blogs/simons/archive/2008/03/31/SQL-Server-2008---HierarchyId---How-do-you-move-nodes-subtrees-around.aspx
The main drawback I have found to hierarchyId is the learning curve. It's not as obvious how to work with it as the other two methods. I have worked with some very bright SQL developers who would frequently get snagged on it, so you end up with one or two resident experts who have to field questions from everyone else.

Related

How to design database tables for hierarchical data with unknown depth?

I want to create hierarchical data with unknown depth like this:
Create categories and sub categories and for those subcategories they will have also subcategories and so on.
The depth of the subcategories is unknown and only will be done in the runtime by the user.
What I though about is to but them all in one table and have a parent column holding the ID of the parent category like this:
I don't know if this is the right way to do it, but I can't see any other way.
I have did a quick search and what I have found is not directly related to DB table design.
I am using MS SQL Server 2012

There are 3 common approaches to this & 1 not so common.
1. Adjacency lists (your approach)
Pro - easy to understand, fast inserts anywhere
Con - slow to query trees of unknown depth recursively
2. Nested sets
Pro - fast to query
Con - Inserts in middle of list are slow
3. Path - like hierarchyid (basically a binary path)
Pro - fast
Con - like hierarchyid usually have limited length - i think hierarchyid is about 892 bytes max
4. Closure table
Pro - Best of nested sets & adjacency lists. Fast inserts & selects.
Con - A bit hard to get your head around at first but worth the effort if performance is an issue
Source: SQL Antipatterns - Bill Karwin

The most widely used design pattern for represnting hierarchies in tables is called "Adjacency List". This is the pattern you've presented in the question.
One alternative is called "Nested Sets". Here is a description of Nested Sets in a nutshell: https://en.wikipedia.org/wiki/Nested_set_model
If you look up Adjacency List vs Nested Set, you'll get a lot of articles discussing the trade offs between the two.
Basically, Adjacency list is easy to update, but hard to work with, except for the most basic operations. nested Set is hard to update, but easy to work with. Operations like find the path from the root, find the sub tree are strightforward and well understood.

Is this a "correct" database design?

I'm working with the new version of a third party application. In this version, the database structure is changed, they say "to improve performance".
The old version of the DB had a general structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES
(
ENTITY_ID,
PROPERTY_KEY,
PROPERTY_VALUE
)
so we had a main table with fields for the basic properties and a separate table to manage custom properties added by user.
The new version of the DB insted has a structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES_n
(
ENTITY_ID_n,
CUSTOM_PROPERTY_1,
CUSTOM_PROPERTY_2,
CUSTOM_PROPERTY_3,
...
)
So, now when the user add a custom property, a new column is added to the current ENTITY_PROPERTY table until the max number of columns (managed by application) is reached, then a new table is created.
So, my question is: Is this a correct way to design a DB structure? Is this the only way to "increase performances"? The old structure required many join or sub-select, but this structute don't seems to me very smart (or even correct)...

I have seen this done before on the assumed (often unproven) "expense" of joining - it is basically turning a row-heavy data table into a column-heavy table. They ran into their own limitation, as you imply, by creating new tables when they run out of columns.
I completely disagree with it.
Personally, I would stick with the old structure and re-evaluate the performance issues. That isn't to say the old way is the correct way, it is just marginally better than the "improvement" in my opinion, and removes the need to do large scale re-engineering of database tables and DAL code.
These tables strike me as largely static... caching would be an even better performance improvement without mutilating the database and one I would look at doing first. Do the "expensive" fetch once and stick it in memory somewhere, then forget about your troubles (note, I am making light of the need to manage the Cache, but static data is one of the easiest to manage).
Or, wait for the day you run into the maximum number of tables per database :-)
Others have suggested completely different stores. This is a perfectly viable possibility and if I didn't have an existing database structure I would be considering it too. That said, I see no reason why this structure can't fit into an RDBMS. I have seen it done on almost all large scale apps I have worked on. Interestingly enough, they all went down a similar route and all were mostly "successful" implementations.

No, it's not. It's terrible.
until the max number of column (handled by application) is reached,
then a new table is created.
This sentence says it all. Under no circumstance should an application dynamically create tables. The "old" approach isn't ideal either, but since you have the requirement to let users add custom properties, it has to be like this.
Consider this:
You lose all type-safety as you have to store all values in the column "PROPERTY_VALUE"
Depending on your users, you could have them change the schema beforehand and then let them run some kind of database update batch job, so at least all the properties would be declared in the right datatype. Also, you could lose the entity_id/key thing.
Check out this: http://en.wikipedia.org/wiki/Inner-platform_effect. This certainly reeks of it
Maybe a RDBMS isn't the right thing for your app. Consider using a key/value based store like MongoDB or another NoSQL database. (http://nosql-database.org/)

From what I know of databases (but I'm certainly not the most experienced), it seems quite a bad idea to do that in your database. If you already know how many max custom properties a user might have, I'd say you'd better set the table number of columns to that value.
Then again, I'm not an expert, but making new columns on the fly isn't the kind of operations databases like. It's gonna bring you more trouble than anything.
If I were you, I'd either fix the number of custom properties, or stick with the old system.

I believe creating a new table for each entity to store properties is a bad design as you could end up bulking the database with tables. The only pro to applying the second method would be that you are not traversing through all of the redundant rows that do not apply to the Entity selected. However using indexes on your database on the original ENTITY_PROPERTIES table could help greatly with performance.
I would personally stick with your initial design, apply indexes and let the database engine determine the best methods for selecting the data rather than separating each entity property into a new table.

There is no "correct" way to design a database - I'm not aware of a universally recognized set of standards other than the famous "normal form" theory; many database designs ignore this standard for performance reasons.
There are ways of evaluating database designs though - performance, maintainability, intelligibility, etc. Quite often, you have to trade these against each other; that's what your change seems to be doing - trading maintainability and intelligibility against performance.
So, the best way to find out if that was a good trade off is to see if the performance gains have materialized. The best way to find that out is to create the proposed schema, load it with a representative dataset, and write queries you will need to run in production.
I'm guessing that the new design will not be perceivably faster for queries like "find STANDARD_PROPERTY_1 from entity where STANDARD_PROPERTY_1 = 'banana'.
I'm guessing it will not be perceivably faster when retrieving all properties for a given entity; in fact it might be slightly slower, because instead of a single join to ENTITY_PROPERTIES, the new design requires joins to several tables. You will be returning "sparse" results - presumably, not all entities will have values in the property_n columns in all ENTITY_PROPERTIES_n tables.
Where the new design may be significantly faster is when you need a compound where clause on custom properties. For instance, finding an entity where custom property 1 is true, custom property 2 is banana, and custom property 3 is not in ('kylie', 'pussycat dolls', 'giraffe') is e`(probably) faster when you can specify columns in the ENTITY_PROPERTIES_n tables instead of rows in the ENTITY_PROPERTIES table. Probably.
As for maintainability - yuck. Your database access code now needs to be far smarter, knowing which table holds which property, and how many columns are too many. The likelihood of entertaining bugs is high - there are more moving parts, and I can't think of any obvious unit tests to make sure that the database access logic is working.
Intelligibility is another concern - this solution is not in most developers' toolbox, it's not an industry-standard pattern. The old solution is pretty widely known - commonly referred to as "entity-attribute-value". This becomes a major issue on long-lived projects where you can't guarantee that the original development team will hang around.

Database design linked list vs. order by

I have a table which need to persist some user actions in sequence. I can either save it by using a self reference table which will be like a linked list or not using the self reference at all and just use the times tamp to keep the sequence.
This table has reference to other tables such as user and files associated with an action.
The operations will need support CRUD. The frequency of operations are in this order: Retrieve > Insert > Update > Delete
What is your design preference and why?
Thanks!

i would avoid "linked lists" like the plague. about the only thing they are good for is retrieving the "next" item. the problem is that every extra "hop" requires a join, so if you want to parameterise over that (eg to provide a function that retrieves N items following a given item) then you need one of (1) machine-generated joins (2) multiple selects (3) SQL that's unlikely to be portable and/or supported by your ORM.
this is the same problem that makes trees notoriously nasty in sql. it's "fixed" by recursive joins, but that's (3) above (maybe i am old-fashioned and someone will say that these are well supported now - if so i guess i will learn...).

How to represent a tree like structure in a db

I'm starting a project and I'm in the designing phase: I.e., I haven't decided yet on which db framework I'm going to use. I'm going to have code that creates a "forest" like structure. That is, many trees, where each tree is a standard: nodes and edges. After the code creates these trees I want to save them in the db. (and then pull them out eventually)
The naive approach to representing the data in the db is a relational db with two tables: nodes and edges. That is, the nodes table will have a node id, node data, etc.. And the edges table will be a mapping of node id to node id.
Is there a better approach? Or given the (limited) assumptions I'm giving this is the best approach? How about if we add an assumption that the trees are relatively small - is it better to save the whole tree as a blob in the db? Which type of db should I use in that case? Please comment on speed/scalability.
Thanks

I showed a solution similar to your nodes & edges tables, in my answer to the StackOverflow question: What is the most efficient/elegant way to parse a flat table into a tree? I call this solution "Closure Table".
I did a presentation on different methods of storing and using trees in SQL, Models for Hierarchical Data with SQL and PHP. I demonstrated that with the right indexes (depending on the queries you need to run), the Closure Table design can have very good performance, even over large collections of edges (about 500K edges in my demo).
I also covered the design in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.

Be sure to use some sort of low level-coding for the entity being treed to prevent looping. The entity might be a part, subject, folder, etc.
With an Entity file and and Entity-Xref file you can loop through one of say two relationships between the two files, a parent and a child relation.
A level is the level an entity found in a tree. A low-level-code for the entity is the lowest level an entity is found in any tree anywhere. Check to make sure the low level code of the entity you want to make a child is less than or equal to prevent a loop. after adding an entity as a child it will become at least one level lower.

Storing Composite Patterns (Hierarchical Data) in Database

What are 'best-practices' for saving Composite patterns in a Relational Database?
We have been using Modified Preorder Tree Traversal. This is very quick to build the whole tree, but very slow to insert or delete new nodes (all left and right values need to be adjusted). Also querying the children of a node is not easy and very slow.
Another thing we noticed is that you really have to make sure the tree doesn't get messy. You need transaction locks, otherwise the left and right values can get corrupt, and fixing a corrupt left right tree is not an easy job.
It does work very good however, the Modified Preorder Tree Traversal, but I was wondering if there are better alternatives.

While finding all descendents of a row with MPTT is fast, finding all children can be slow. However you should be able to fix that by adding a parent_id field to your table that records (yes, redundantly) the parent of the row. Then the search becomes:
SELECT *
FROM tbl
WHERE parent_id = z
Yes, parent_id contains redundant information, potentially denormalizing your table -- but since any insert/update/delete already requires global changes, keeping parent_id up-to-date isn't much extra to pay. You could alternatively use a level field that records the vertical level of the row, although that is in fact more likely to change under certain types of transformations (e.g. moving a subtree to a different point in the tree).
The plain old link-to-parent representation (i.e. just having parent_id and no left_pos or right_pos), is of course faster for insert/update-heavy workloads, but the only queries it can answer efficiently are "Find the parent of X" and "Find the children of X." Most workloads involve much more reading than writing, so usually MPTT is faster overall -- but perhaps in your case you need to consider moving ("back") to link-to-parent?

The best way to store hierakial data in a database I have heard is to use a string attribute where the content is the list of parents separated by, say colons.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight