I want to create hierarchical data with unknown depth like this:
Create categories and sub categories and for those subcategories they will have also subcategories and so on.
The depth of the subcategories is unknown and only will be done in the runtime by the user.
What I though about is to but them all in one table and have a parent column holding the ID of the parent category like this:
I don't know if this is the right way to do it, but I can't see any other way.
I have did a quick search and what I have found is not directly related to DB table design.
I am using MS SQL Server 2012
There are 3 common approaches to this & 1 not so common.
1. Adjacency lists (your approach)
Pro - easy to understand, fast inserts anywhere
Con - slow to query trees of unknown depth recursively
2. Nested sets
Pro - fast to query
Con - Inserts in middle of list are slow
3. Path - like hierarchyid (basically a binary path)
Pro - fast
Con - like hierarchyid usually have limited length - i think hierarchyid is about 892 bytes max
4. Closure table
Pro - Best of nested sets & adjacency lists. Fast inserts & selects.
Con - A bit hard to get your head around at first but worth the effort if performance is an issue
Source: SQL Antipatterns - Bill Karwin
The most widely used design pattern for represnting hierarchies in tables is called "Adjacency List". This is the pattern you've presented in the question.
One alternative is called "Nested Sets". Here is a description of Nested Sets in a nutshell: https://en.wikipedia.org/wiki/Nested_set_model
If you look up Adjacency List vs Nested Set, you'll get a lot of articles discussing the trade offs between the two.
Basically, Adjacency list is easy to update, but hard to work with, except for the most basic operations. nested Set is hard to update, but easy to work with. Operations like find the path from the root, find the sub tree are strightforward and well understood.
Related
I've considered creating a Vertices table and an Edges table but would building graphs in memory and traversing sub-graphs require a large number of lookups? I'd like to avoid excessive database reads. Is there any other way of persisting a graph?
Side note: I've heard of Neo4j but my question is really how to conceptually represent a graph in a standard database. I am open to some NoSQL solutions like mongodb though.
The answer is unfortunately: Your consideration is completely right in every point. You have to store Nodes (Vertices) in one table, and Edges referencing a FromNode and a ToNode to convert a graph data structure to a relational data structure. And you are also right, that this ends up in a large number of lookups, because you are not able to partition it into subgraphs, that might be queried at once. You have to traverse from Node to Edge to Node to Edge to Node...and so on (Recursively, while SQL is working with Sets).
The point is...
Relational, Graph oriented, Object oriented, Document based are different types of data structures that meet different requirements. Thats what its all about and why so many different NoSQL Databases (most of them are simple document stores) came up, because it simply makes no sense to organize big data in a relational way.
Alternative 1 - Graph oriented database
But there are also graph oriented NoSQL databases, which make the graph data model a first class citizen like OrientDB which I am playing around with a little bit at the moment. The nice thing about it is, that although it persists data as a graph, it still can be used in a relational or even object oriented or document oriented way also (i.e. by querying with plain old SQL). Nevertheless Traversing the graph is the optimal way to get data out of it for sure.
Alternative 2 - working with graphs in memory
When it comes to fast routing, routing frameworks like Graphhopper build up the complete Graph (Billions of Nodes) inside memory. Because Graphhopper uses a MemoryMapped Implementation of its GraphStore, that even works on Android Devices with only some MB of Memory need. The complete graph is read from database into memor at startup, and routing is then done there, so you have no need to lookup the database.
I faced this same issue and decided to finally go with the following structure, which requires 2 database queries, then the rest of the work is in memory:
Store nodes in a table and reference the graph with each node record:
Table Nodes
id | title | graph_id
---------------------
105 | node1 | 2
106 | node2 | 2
Also store edges in another table and again reference the graph these edges belong to with each edge:
Table Edges
id | from_node_id | to_node_id | graph_id
-----------------------------------------
1 | 105 | 106 | 2
2 | 106 | 105 | 2
Get all the nodes with one query, then get all the edges with another.
Now build your preferred way to store the graph (e.g., adjacency list) and proceed with your application flow.
Adding to the previous answers the fact that MS SQL Server adds support for Graph Architecture starting with 2017.
It follows the described pattern of having Nodes and Edges tables (which should be created with special "AS NODE" and "AS EDGE" keywords).
It also has new MATCH keyword introduced "to support pattern matching and traversal through the graph" like this (friend is a name of edge table in the below example):
SELECT Person2.name AS FriendName
FROM Person Person1, friend, Person Person2
WHERE MATCH(Person1-(friend)->Person2)
AND Person1.name = 'Alice';
There is also a really good set of articles on SQL Server Graph Databases on redgate Hub.
I am going to disagree with the other posts here. If you have special class of graphs with restrictions, you can often get away with a more specialized design (for example, limited number of edges per vertex, only need to traverse one way, etc).
However, for storing an arbitrary graph, relational databases are an excellent choice. They're designed with an incredibly good set of tradeoffs that perform well in almost all situations. In addition, data needs tend to change overtime, and a relational database let's you painlessly change the storage and lookup without changing the data representation.
Let's review your design:
one table for vertices (id, data)
one table for edges (startId, endId, data)
First observe that the storage is efficient as it is proportional to the data to store. If we have 10 vertices and 10 edges, we store 20 pieces of information.
Now, let's look at lookup. Assuming we have an index on vertex id, we can look up any data we want in at least log(n) (maybe better depending on index).
Given a node tell me the edges leaving it
Given a node tell me the edges entering it
Given an edge tell me the node it came from or enters
That's all the basic queries you need.
Now suppose you had a "graph database" that stores a list of edges leaving each vertex. This makes each vertex variable size. It a little easier to traverse. But, what if you want to traverse the other direction? Now you have you store a list of edges entering each vertex as well.
Now you have two copies of that information, and the database (or you the developer) must do a lot of work to make sure they don't ever get out of sync.
O(log(n)) vs O(1)
Relational database indices typically store data in a sorted form, or as others have pointed out, can also use a hash table.
Even if you are stuck with sorted it's going to perform very well.
First note that big oh measures scalability, not performance. Hashes, can be slower than many loops for small data sets. Even though hashing O(1) is better, binary search O(log2) is pretty darn good. You can search a billion records in 30 steps! In addition, it is cache and branch predictor friendly.
I have a set of hierarchical data being used in a SQL Server database. The data is stored with a guid as the primary key, and a parentGuid as a foreign key pointing to the objects immediate parent. I access the data most often through Entity Framework in a WebApi project. To make the situation a little more complex I also need to manage permission based on this hierarchy such that a permission applied to a parent applies to all of its descendants. My question is this:
I have searched all over and cannot decide which would be best to handle this situation. I know I have the following options.
I can create Recursive CTEs, Common Table Expression, (aka RCTE) to handle the hierarchical data. This seems to be the most simple approach for normal access, but I'm worried it may be slow when used to determine permission levels for child objects.
I can create a hierarchyId data type field in the table and use SQL Server provided functions such as GetAncestor(), IsDescendantOf(), and etc. This seems like it would make querying fairly easy, but seems to require a fairly complex insert/update trigger to keep the hierarchyId field correct through inserts and moves
I can create a closure table, which would store all of the relationships in the table. I imagine it as such: parent column and child column, each parent -> child relationship would be represented. (ie 1->2 2->3 would be represented in the database as 1-2, 1-3, 2-3). The downside is that this requires insert, update, and delete triggers even though they are fairly simple, and this method generates a lot of records.
I have tried searching all over and can't find anything giving any advice between these three methods.
PS I am also open to any alternative solutions to this problem
I have used all three methods. It's mostly a question of taste.
I agree that hierarchy with parent-child relationships in the table is the simplest. Moving a subtree is simple and it's easy to code the recursive access with CTEs. Performance is only going to be an issue if you have very large tree structures and you are frequently accessing the hierarchical data. For the most part, recursive CTEs are very fast when you have the correct indexes on the table.
The closure table is more like a supplement to the above. Finding all the descendants of a given node is lightning fast, you don't need the CTEs, just one extra join, so it's sweet. Yes, the number of records blows up, but I think it is no more than N-1 times the number of nodes for a tree of depth N (e.g. a tertiary tree of depth 5 would require 1 + 3 + 9 + 27 + 81 = 121 connections when storing only the parent-child relationship vs. 1 + 3 + (9 * 2) + (27 * 3) + (81 * 4) = 427 for the closure table). In addition, the closure table records are so narrow (just 2 ints at a minimum) that they take up almost no space. Generating the list of records to insert into the closure table when a new record is inserted into the hierarchy takes a tiny bit of overhead.
I personally like HierarchyId since it really combines the benefit of the above two, which is compact storage, and lightning fast access. Once you get it set up, it is easy to query and takes very little space. As you mentioned, it's a little tricky to move subtrees around, but it's manageable. Anyway, how often do you really move a subtree in a hierarchy? There are some links you can find that will suggest some methods, e.g.:
http://sqlblogcasts.com/blogs/simons/archive/2008/03/31/SQL-Server-2008---HierarchyId---How-do-you-move-nodes-subtrees-around.aspx
The main drawback I have found to hierarchyId is the learning curve. It's not as obvious how to work with it as the other two methods. I have worked with some very bright SQL developers who would frequently get snagged on it, so you end up with one or two resident experts who have to field questions from everyone else.
I've been trying to learn programming for a while. I've studied Java and Python, and I'm comfortable with their syntax. Recently, I wanted to use what I've learnt with coding a tangible software from ground up.
I want to implement a database engine, sort of a NoSQL database. I've put together a small document, sort of a specification to follow throughout my adventure of coding it. But all I know is a bunch of keywords. I don't know where to start.
Can someone help me find out how to gather the knowledge I need for this kind of work and in what order to learn things? I have searched for documents, but I feel like I'll end up finding unrelated/erroneous content or start from a wrong point, because implementing a complete database engine is (seeming to be) a truly complicated task.
I wan't to express that I'd prefer theses and whitepapers and (e)books to codes of other projects, because I've asked a question of kind in which people usually get answered in the form of "read project - x' source code". I'm not at the level of comfortably reading and understanding source code.
First, you may have a look that the answers for How to write a simple database engine. While it focus on a SQL engine, there is still a lot of good material in the answers.
Otherwise, a good project tutorial is Implementation of a B-Tree Database Class. The example code is in C++, but the description of what is done and why is probably what you'll want to look at anyway.
Also, there is Designing and Implementing Structured Storage (Database Engine) over at MSDN. Plenty of information there to help you in your learning project.
Because the accepted answer only offers (good) links to other resources, I'd thought I share my experience writing webdb, a small experimental database for browsers. I also invite you to read the source code. It's pretty small. You should be able to read through it and get a basic understanding of what it's doing in a couple of hours. Warning: I am a n00b at this and since writing it I learned a lot more about it and see I have been doing some things wrong. It can help you get started though.
The basics: BTree
I started out with adapting an AVL tree to suit my needs. An AVL tree is a kind of self-balancing binary search tree. You store the key K and related data (if any) in a node, then all items with key < K in a node in the left subtree and all items with key > K in a right subtree. You can use an array to store the data items if you want to support non unique keys.
This tree will give you the basics: Create, Update, Delete and a way to quickly get an item by key, or all items with key < x, or with key between x and y etc. It can serve as the index for our table.
A schema
As a next step I wrote code that lets the client code define a schema. Methods like createTable() etc. Schemas are typically associated with SQL, but even no-SQL sort-of has a schema; they usually require you to mark the ID field and any other fields you want to search on. You can make your schema as fancy as you want, but you typically want to model at least which column(s) serve as primary key and which fields will be searched on frequently and need an index.
Creating a data structure to store a table
I decided to use the tree I created in the first step to store my items. These were simple JS objects. Having defined which field contains the PK, I could simply insert the item into the tree using that field's value as the key. This gives me quick lookup by ID (range).
Next I added another tree for every column that needs an index. In these trees I did not store the full record, but only the key. So to fetch a customer by last name, I would first use the index on last name to get the ID, then the primary key index to get the actual record. The reason I did not just store the (reference to the) actual object is because it makes set operations a little bit simpler (see next step)
Querying
Now that we have a table with indexes for PK and search fields, we can implement querying. I did not take this very far as it becomes complicated quickly, but you can get some nice functionality with just some basics. WebDB does not implement joins; all queries operate only on a single table. But once you understand this you see a pretty clear (though long and winding) path to doing joins and other complicated stuff as well.
In WebDB, to get all customers with firstName = 'John' and city = 'New York' (assuming those are two search fields), you would write something like:
var webDb = ...
var johnsFromNY = webDb.customers.get({
firstName: 'John',
city: 'New York'
})
To solve it, we first do two lookups: we get the set X of all IDs of customers named 'John' and we get the set Y of all IDs of customers from New York. We then perform an intersection on these two sets to get all IDs of customers that are both named 'John' AND from New York. We then run through our set of resulting IDs, getting the actual record for each one and adding it to the result array.
Using the set operators like union and intersection we can perform AND and OR searches. I only implemented AND.
Doing joins would (I think) involve creating temporary tables in memory, then populating them as the query runs with the joined results, then applying the query criteria to the temp table. I never got there. I attempted some syncing logic next but that was too ambitious and it went downhill from there :)
I'm starting a project and I'm in the designing phase: I.e., I haven't decided yet on which db framework I'm going to use. I'm going to have code that creates a "forest" like structure. That is, many trees, where each tree is a standard: nodes and edges. After the code creates these trees I want to save them in the db. (and then pull them out eventually)
The naive approach to representing the data in the db is a relational db with two tables: nodes and edges. That is, the nodes table will have a node id, node data, etc.. And the edges table will be a mapping of node id to node id.
Is there a better approach? Or given the (limited) assumptions I'm giving this is the best approach? How about if we add an assumption that the trees are relatively small - is it better to save the whole tree as a blob in the db? Which type of db should I use in that case? Please comment on speed/scalability.
Thanks
I showed a solution similar to your nodes & edges tables, in my answer to the StackOverflow question: What is the most efficient/elegant way to parse a flat table into a tree? I call this solution "Closure Table".
I did a presentation on different methods of storing and using trees in SQL, Models for Hierarchical Data with SQL and PHP. I demonstrated that with the right indexes (depending on the queries you need to run), the Closure Table design can have very good performance, even over large collections of edges (about 500K edges in my demo).
I also covered the design in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
Be sure to use some sort of low level-coding for the entity being treed to prevent looping. The entity might be a part, subject, folder, etc.
With an Entity file and and Entity-Xref file you can loop through one of say two relationships between the two files, a parent and a child relation.
A level is the level an entity found in a tree. A low-level-code for the entity is the lowest level an entity is found in any tree anywhere. Check to make sure the low level code of the entity you want to make a child is less than or equal to prevent a loop. after adding an entity as a child it will become at least one level lower.
I have a product catalog. Each category consists of different number (in deep) of subcategories. The number of levels (deep) is unknown, but I quite sure that it will not be exceed of 5,6 levels. The data changes are much more rarely then reads.
The question is: what type of hierarchical data model is more suitable for such situation. The project is based on Django framework and it's peculiarities (admin i-face, models handling...) should be considered.
Many thanks!
Nested sets are better for performance, if you don't need frequent updates or hierarchical ordering.
If you need either tree updates or hierarchical ordering, it's better to use parent-child data model.
It's easily constructed in Oracle and SQL Server 2005+, and not so easily (but still possible) in MySQL.
I would use the Modified Preorder Tree Traversal algorithm, MPTT, for this sort of hierarchical data. This allows great performance on traversing the tree and finding children, if you don't mind a bit of a penalty on changes to the structure.
Luckily Django has a great library available for this, django-mptt. I've used this in a number of projects with a lot of success. There's also django-treebeard which offers several alternative algorithms, but I haven't used it (and it doesn't seem as popular as mptt anyway).
According to these articles:
http://explainextended.com/2009/09/24/adjacency-list-vs-nested-sets-postgresql/
http://explainextended.com/2009/09/29/adjacency-list-vs-nested-sets-mysql/
"MySQL is the only system of the big four (MySQL, Oracle, SQL Server, PostgreSQL) for which the nested sets model shows decent performance and can be considered to stored hierarchical data."
http://www.sqlsummit.com/AdjacencyList.htm
The Adjacency List is much easier to maintain and Nested Sets are a lot faster to query.
The problem has always been that converting an Adjacency List to Nested Sets has taken way too long thanks to a really nasty "push stack" method that's loaded with RBAR. So people end up doing some really difficult maintenance in Nested Sets or not using them.
Now, you can have your cake and eat it, too! You can do the conversion on 100,000 nodes in less than 4 seconds and on a million rows in less than a minute! All in T-SQL, by the way! Please see the following articles.
Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets
Hierarchies on Steroids #2: A Replacement for Nested Sets Calculations