Hierarchical Data Models: Adjacency List vs. Nested Sets - database

I have a product catalog. Each category consists of different number (in deep) of subcategories. The number of levels (deep) is unknown, but I quite sure that it will not be exceed of 5,6 levels. The data changes are much more rarely then reads.
The question is: what type of hierarchical data model is more suitable for such situation. The project is based on Django framework and it's peculiarities (admin i-face, models handling...) should be considered.
Many thanks!

Nested sets are better for performance, if you don't need frequent updates or hierarchical ordering.
If you need either tree updates or hierarchical ordering, it's better to use parent-child data model.
It's easily constructed in Oracle and SQL Server 2005+, and not so easily (but still possible) in MySQL.

I would use the Modified Preorder Tree Traversal algorithm, MPTT, for this sort of hierarchical data. This allows great performance on traversing the tree and finding children, if you don't mind a bit of a penalty on changes to the structure.
Luckily Django has a great library available for this, django-mptt. I've used this in a number of projects with a lot of success. There's also django-treebeard which offers several alternative algorithms, but I haven't used it (and it doesn't seem as popular as mptt anyway).

According to these articles:
http://explainextended.com/2009/09/24/adjacency-list-vs-nested-sets-postgresql/
http://explainextended.com/2009/09/29/adjacency-list-vs-nested-sets-mysql/
"MySQL is the only system of the big four (MySQL, Oracle, SQL Server, PostgreSQL) for which the nested sets model shows decent performance and can be considered to stored hierarchical data."

http://www.sqlsummit.com/AdjacencyList.htm

The Adjacency List is much easier to maintain and Nested Sets are a lot faster to query.
The problem has always been that converting an Adjacency List to Nested Sets has taken way too long thanks to a really nasty "push stack" method that's loaded with RBAR. So people end up doing some really difficult maintenance in Nested Sets or not using them.
Now, you can have your cake and eat it, too! You can do the conversion on 100,000 nodes in less than 4 seconds and on a million rows in less than a minute! All in T-SQL, by the way! Please see the following articles.
Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets
Hierarchies on Steroids #2: A Replacement for Nested Sets Calculations

Related

sql | slow queries | avoid many joins

I am currently working with java spring and postgres.
I have a query on a table, many filters can be applied to the query and each filter needs many joins.
This query is very slow, due to the number of joins that must be performed, also because there are many elements in the table.
Foreign keys and indexes are correctly created.
I know one approach could be to keep duplicate information to avoid doing the joins. By this I mean creating a new table called infoSearch and keeping it updated via triggers. At the time of the query, perform search operations on said table. This way I would do just one join.
But I have some doubts:
What is the best approach in postgres to save item list flat?
I know there is a json datatype, could I use this to hold the information needed for the search and use jsonPath? is this performant with lists?
I also greatly appreciate any advice on another approach that can be used to fix this.
Is there any software that can be used to make this more efficient?
I'm wondering if it wouldn't be more performant to move to another style of database, like graph based. At this point the only problem I have is with this specific table, the rest of the problem is simple queries that adapt very well to relational bases.
Is there any scaling stat based on ratios and number of items which base to choose from?
Denormalization is a tried and true way to speed up queries/reports/searching processes for relational databases. It uses a standard time vs space tradeoff to reduce the time of query, at the cost of duplicating the data and increasing write/insert time.
There are third party tools that are specifically designed for this use-case, including search tools (like ElasticSearch, Solr, etc) and other document-centric databases. Graph databases are probably not useful in this context. They are focused on traversing relationships, not broad searches.

Django - Optional recursive relationship

I am trying to use Django to create a recursive relationship, which gives users a folder-like hierarchical structure in which to place resources.
What would be the best way to achieve this?
I know I could use treebeard or mptt to create a nested set but I have read that making changes to the tree structure (something that would be happening a lot in this case) can be quite an intensive operation as a lot of fields have to be updated.
On the other hand, I could folder model with a ForeignKey to self but how do I manage the top level folders with no foreign key value? Will Django complain if I just set this value to be NULL?
Any advice appreciated.
Thanks.
Treebeard actually supports three different tree implementations, just choose the one that will suite your needs.
Adjacency List (fast writes at the cost of slow reads)
Materialized Path (probably the fastest way of working with trees in SQL)
Nested Sets (very efficient reads at the cost of high maintenance on write/delete operations)
Docs are here: https://tabo.pe/projects/django-treebeard/docs/tip/

How to represent a tree like structure in a db

I'm starting a project and I'm in the designing phase: I.e., I haven't decided yet on which db framework I'm going to use. I'm going to have code that creates a "forest" like structure. That is, many trees, where each tree is a standard: nodes and edges. After the code creates these trees I want to save them in the db. (and then pull them out eventually)
The naive approach to representing the data in the db is a relational db with two tables: nodes and edges. That is, the nodes table will have a node id, node data, etc.. And the edges table will be a mapping of node id to node id.
Is there a better approach? Or given the (limited) assumptions I'm giving this is the best approach? How about if we add an assumption that the trees are relatively small - is it better to save the whole tree as a blob in the db? Which type of db should I use in that case? Please comment on speed/scalability.
Thanks
I showed a solution similar to your nodes & edges tables, in my answer to the StackOverflow question: What is the most efficient/elegant way to parse a flat table into a tree? I call this solution "Closure Table".
I did a presentation on different methods of storing and using trees in SQL, Models for Hierarchical Data with SQL and PHP. I demonstrated that with the right indexes (depending on the queries you need to run), the Closure Table design can have very good performance, even over large collections of edges (about 500K edges in my demo).
I also covered the design in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
Be sure to use some sort of low level-coding for the entity being treed to prevent looping. The entity might be a part, subject, folder, etc.
With an Entity file and and Entity-Xref file you can loop through one of say two relationships between the two files, a parent and a child relation.
A level is the level an entity found in a tree. A low-level-code for the entity is the lowest level an entity is found in any tree anywhere. Check to make sure the low level code of the entity you want to make a child is less than or equal to prevent a loop. after adding an entity as a child it will become at least one level lower.

NoSQL DB and Reporting

I am in the architecture stage of an academic project involving billions of records. The project should be very lightweight in terms of computing power and highly scalable.
The information structure is very simple: I need to store a list of items each one with different features. The feature are integers, decimals, dates, strings etc. When the data is imported the types of the feature is known. Also, features can be used to reference other items.
I need to be able to get and sort a list of items by its features (more than one) - possibly using queries such as >, <, =, and regexes, length, left, right, mid for strings between the feature values and against user arbitrary input.
Reporting in the sense of sums, averages, grouping is also necessary by the demands for that are more relaxed - there is not need for a full cube capabilities, but more are better.
I am very new to the whole NoSQL world. What would you recommend?.
If you check out the tutorials for MongoDB, they have, in my opinion, the best introduction to the Map/Reduce system that is used to query and aggregrate.
I do wonder though why you have concluded in advance that NoSQL is the route to go. Although different items may have different schemas, are there a fixed number of entities and attributes, and why have you (if you have) ruled out SQL, which, after all, has decades of accumulated features for storing and querying data.
If you are going to use aggregates then you could use map reduce to populate aggregate tables and then serve that data.
Writing map reduce for every query may be cumbersome, you can also have a look at Apache Pig and Hive. This is especially helpful for the kindly of adhoc queries you are talking about.

Are nested intervals a viable solution to nested set (modified pre-order traversal) RDBMS performance degredation?

Among the known limitations of Joe Celko's nested sets (modified pre-order traversal) is marked degredation in performance as the tree grows to a large size.
Vadim Tropashko proposed nested intervals, and provides examples and theory explanation in this paper: http://arxiv.org/html/cs.DB/0401014
Is this a viable solution, are there any viable examples (in any language) abstracted away from the native DB layer?
While I've seen examples for nested sets, I haven't seen much for nested intervals, although in theory it shouldn't be difficult to convert from one to the other. Instead of doing pre-order traversal to label the nodes, do a breadth-first recursion. The trick is to work out the most efficient way of labelling n children of a node. Since the node between a/b and c/d is (a+c)/(b+d), an ill-conditioned insert (for instance, inserting the children left to right), runs the risk of creating the same exponential growth in the index values as, for instance, using a full materialized path. It is not difficult to counteract this effect - create the new indexes one at a time, inserting each at the location that produces the lowest resulting denominator.
As far as performance degradation goes, much depends on the operations you intend to do. There are still some operations that will require a complete relabeling of the entire tree - the nested set or nested interval methods both work best for structures that seldom change. If you are doing a lot of structure changes to the hierarchy, the 'standard' parent-child table structure may be easier to work with. remember too that some operations (such as number of descendants) are far easier with the integer labeling of nested sets than the interval methods.
I have written a gem that abstracts away all the computations of nested intervals to be used with Rails's ActiveRecord https://github.com/clyfe/acts_as_nested_interval/ used in production on several systems.

Resources