Django - Optional recursive relationship - database

I am trying to use Django to create a recursive relationship, which gives users a folder-like hierarchical structure in which to place resources.
What would be the best way to achieve this?
I know I could use treebeard or mptt to create a nested set but I have read that making changes to the tree structure (something that would be happening a lot in this case) can be quite an intensive operation as a lot of fields have to be updated.
On the other hand, I could folder model with a ForeignKey to self but how do I manage the top level folders with no foreign key value? Will Django complain if I just set this value to be NULL?
Any advice appreciated.
Thanks.

Treebeard actually supports three different tree implementations, just choose the one that will suite your needs.
Adjacency List (fast writes at the cost of slow reads)
Materialized Path (probably the fastest way of working with trees in SQL)
Nested Sets (very efficient reads at the cost of high maintenance on write/delete operations)
Docs are here: https://tabo.pe/projects/django-treebeard/docs/tip/

Related

sql | slow queries | avoid many joins

I am currently working with java spring and postgres.
I have a query on a table, many filters can be applied to the query and each filter needs many joins.
This query is very slow, due to the number of joins that must be performed, also because there are many elements in the table.
Foreign keys and indexes are correctly created.
I know one approach could be to keep duplicate information to avoid doing the joins. By this I mean creating a new table called infoSearch and keeping it updated via triggers. At the time of the query, perform search operations on said table. This way I would do just one join.
But I have some doubts:
What is the best approach in postgres to save item list flat?
I know there is a json datatype, could I use this to hold the information needed for the search and use jsonPath? is this performant with lists?
I also greatly appreciate any advice on another approach that can be used to fix this.
Is there any software that can be used to make this more efficient?
I'm wondering if it wouldn't be more performant to move to another style of database, like graph based. At this point the only problem I have is with this specific table, the rest of the problem is simple queries that adapt very well to relational bases.
Is there any scaling stat based on ratios and number of items which base to choose from?
Denormalization is a tried and true way to speed up queries/reports/searching processes for relational databases. It uses a standard time vs space tradeoff to reduce the time of query, at the cost of duplicating the data and increasing write/insert time.
There are third party tools that are specifically designed for this use-case, including search tools (like ElasticSearch, Solr, etc) and other document-centric databases. Graph databases are probably not useful in this context. They are focused on traversing relationships, not broad searches.

Modeling for Graph DBs

Coming from as SQL/NoSQL background I am finding it quite challenging to model (efficiently that is) the simplest of exercises on a Graph DB.
While different technologies have limitations and best practices, I am uncertain whether the mindset that I am using while creating the models is the correct one, hence, I am in the need of guidance, advice and/or resources to help me get closer to the right practices.
The initial exercise I have tried is representing a file share entire directory (subfolders and files) in a graph DB. For instance some of the attributes and queries I would like to include are;
The hierarchical structure of the folders
The aggregate size at the current node
Being able to search based on who created a file/folder
Being able to search on file types
This brings me to the following questions
When/Which attributes should be used for edges. Only those on which I intend to search? Only relationships?
Should I wish to extend my graph capabilities, for instance, search on files bigger than X? How does one try to maximize the future capabilities/flexibility of the model so that such changes do not cause massive impacts.
Currently I am exploring InfiniteGraph and TitanDB.
1) The only attribute I can think of to describe an edge in a folder hierarchy is whether it is a contains or contained-by relationship.
(You don't even need that if you decide to consider all your edges one or the other. In your case, it looks like you'll almost always be interrogating descendants to search and to return aggregate size).
This is a lot simpler than a network, or a hierarchy where the edges may be of different types. Think an organization chart that tracks not only who manages whom, but who supports whom, mentors whom, harasses whom, whatever.
2) I'm not familiar with the two databases you mentioned, but Neo4J allows indexes on node properties, so adding an index on file_size should not have much impact. It's also "schema-less," so that you can add attributes on the fly and various nodes may contain different attributes.

Is this a "correct" database design?

I'm working with the new version of a third party application. In this version, the database structure is changed, they say "to improve performance".
The old version of the DB had a general structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES
(
ENTITY_ID,
PROPERTY_KEY,
PROPERTY_VALUE
)
so we had a main table with fields for the basic properties and a separate table to manage custom properties added by user.
The new version of the DB insted has a structure like this:
TABLE ENTITY
(
ENTITY_ID,
STANDARD_PROPERTY_1,
STANDARD_PROPERTY_2,
STANDARD_PROPERTY_3,
...
)
TABLE ENTITY_PROPERTIES_n
(
ENTITY_ID_n,
CUSTOM_PROPERTY_1,
CUSTOM_PROPERTY_2,
CUSTOM_PROPERTY_3,
...
)
So, now when the user add a custom property, a new column is added to the current ENTITY_PROPERTY table until the max number of columns (managed by application) is reached, then a new table is created.
So, my question is: Is this a correct way to design a DB structure? Is this the only way to "increase performances"? The old structure required many join or sub-select, but this structute don't seems to me very smart (or even correct)...
I have seen this done before on the assumed (often unproven) "expense" of joining - it is basically turning a row-heavy data table into a column-heavy table. They ran into their own limitation, as you imply, by creating new tables when they run out of columns.
I completely disagree with it.
Personally, I would stick with the old structure and re-evaluate the performance issues. That isn't to say the old way is the correct way, it is just marginally better than the "improvement" in my opinion, and removes the need to do large scale re-engineering of database tables and DAL code.
These tables strike me as largely static... caching would be an even better performance improvement without mutilating the database and one I would look at doing first. Do the "expensive" fetch once and stick it in memory somewhere, then forget about your troubles (note, I am making light of the need to manage the Cache, but static data is one of the easiest to manage).
Or, wait for the day you run into the maximum number of tables per database :-)
Others have suggested completely different stores. This is a perfectly viable possibility and if I didn't have an existing database structure I would be considering it too. That said, I see no reason why this structure can't fit into an RDBMS. I have seen it done on almost all large scale apps I have worked on. Interestingly enough, they all went down a similar route and all were mostly "successful" implementations.
No, it's not. It's terrible.
until the max number of column (handled by application) is reached,
then a new table is created.
This sentence says it all. Under no circumstance should an application dynamically create tables. The "old" approach isn't ideal either, but since you have the requirement to let users add custom properties, it has to be like this.
Consider this:
You lose all type-safety as you have to store all values in the column "PROPERTY_VALUE"
Depending on your users, you could have them change the schema beforehand and then let them run some kind of database update batch job, so at least all the properties would be declared in the right datatype. Also, you could lose the entity_id/key thing.
Check out this: http://en.wikipedia.org/wiki/Inner-platform_effect. This certainly reeks of it
Maybe a RDBMS isn't the right thing for your app. Consider using a key/value based store like MongoDB or another NoSQL database. (http://nosql-database.org/)
From what I know of databases (but I'm certainly not the most experienced), it seems quite a bad idea to do that in your database. If you already know how many max custom properties a user might have, I'd say you'd better set the table number of columns to that value.
Then again, I'm not an expert, but making new columns on the fly isn't the kind of operations databases like. It's gonna bring you more trouble than anything.
If I were you, I'd either fix the number of custom properties, or stick with the old system.
I believe creating a new table for each entity to store properties is a bad design as you could end up bulking the database with tables. The only pro to applying the second method would be that you are not traversing through all of the redundant rows that do not apply to the Entity selected. However using indexes on your database on the original ENTITY_PROPERTIES table could help greatly with performance.
I would personally stick with your initial design, apply indexes and let the database engine determine the best methods for selecting the data rather than separating each entity property into a new table.
There is no "correct" way to design a database - I'm not aware of a universally recognized set of standards other than the famous "normal form" theory; many database designs ignore this standard for performance reasons.
There are ways of evaluating database designs though - performance, maintainability, intelligibility, etc. Quite often, you have to trade these against each other; that's what your change seems to be doing - trading maintainability and intelligibility against performance.
So, the best way to find out if that was a good trade off is to see if the performance gains have materialized. The best way to find that out is to create the proposed schema, load it with a representative dataset, and write queries you will need to run in production.
I'm guessing that the new design will not be perceivably faster for queries like "find STANDARD_PROPERTY_1 from entity where STANDARD_PROPERTY_1 = 'banana'.
I'm guessing it will not be perceivably faster when retrieving all properties for a given entity; in fact it might be slightly slower, because instead of a single join to ENTITY_PROPERTIES, the new design requires joins to several tables. You will be returning "sparse" results - presumably, not all entities will have values in the property_n columns in all ENTITY_PROPERTIES_n tables.
Where the new design may be significantly faster is when you need a compound where clause on custom properties. For instance, finding an entity where custom property 1 is true, custom property 2 is banana, and custom property 3 is not in ('kylie', 'pussycat dolls', 'giraffe') is e`(probably) faster when you can specify columns in the ENTITY_PROPERTIES_n tables instead of rows in the ENTITY_PROPERTIES table. Probably.
As for maintainability - yuck. Your database access code now needs to be far smarter, knowing which table holds which property, and how many columns are too many. The likelihood of entertaining bugs is high - there are more moving parts, and I can't think of any obvious unit tests to make sure that the database access logic is working.
Intelligibility is another concern - this solution is not in most developers' toolbox, it's not an industry-standard pattern. The old solution is pretty widely known - commonly referred to as "entity-attribute-value". This becomes a major issue on long-lived projects where you can't guarantee that the original development team will hang around.

How to represent a tree like structure in a db

I'm starting a project and I'm in the designing phase: I.e., I haven't decided yet on which db framework I'm going to use. I'm going to have code that creates a "forest" like structure. That is, many trees, where each tree is a standard: nodes and edges. After the code creates these trees I want to save them in the db. (and then pull them out eventually)
The naive approach to representing the data in the db is a relational db with two tables: nodes and edges. That is, the nodes table will have a node id, node data, etc.. And the edges table will be a mapping of node id to node id.
Is there a better approach? Or given the (limited) assumptions I'm giving this is the best approach? How about if we add an assumption that the trees are relatively small - is it better to save the whole tree as a blob in the db? Which type of db should I use in that case? Please comment on speed/scalability.
Thanks
I showed a solution similar to your nodes & edges tables, in my answer to the StackOverflow question: What is the most efficient/elegant way to parse a flat table into a tree? I call this solution "Closure Table".
I did a presentation on different methods of storing and using trees in SQL, Models for Hierarchical Data with SQL and PHP. I demonstrated that with the right indexes (depending on the queries you need to run), the Closure Table design can have very good performance, even over large collections of edges (about 500K edges in my demo).
I also covered the design in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
Be sure to use some sort of low level-coding for the entity being treed to prevent looping. The entity might be a part, subject, folder, etc.
With an Entity file and and Entity-Xref file you can loop through one of say two relationships between the two files, a parent and a child relation.
A level is the level an entity found in a tree. A low-level-code for the entity is the lowest level an entity is found in any tree anywhere. Check to make sure the low level code of the entity you want to make a child is less than or equal to prevent a loop. after adding an entity as a child it will become at least one level lower.

Hierarchical Data Models: Adjacency List vs. Nested Sets

I have a product catalog. Each category consists of different number (in deep) of subcategories. The number of levels (deep) is unknown, but I quite sure that it will not be exceed of 5,6 levels. The data changes are much more rarely then reads.
The question is: what type of hierarchical data model is more suitable for such situation. The project is based on Django framework and it's peculiarities (admin i-face, models handling...) should be considered.
Many thanks!
Nested sets are better for performance, if you don't need frequent updates or hierarchical ordering.
If you need either tree updates or hierarchical ordering, it's better to use parent-child data model.
It's easily constructed in Oracle and SQL Server 2005+, and not so easily (but still possible) in MySQL.
I would use the Modified Preorder Tree Traversal algorithm, MPTT, for this sort of hierarchical data. This allows great performance on traversing the tree and finding children, if you don't mind a bit of a penalty on changes to the structure.
Luckily Django has a great library available for this, django-mptt. I've used this in a number of projects with a lot of success. There's also django-treebeard which offers several alternative algorithms, but I haven't used it (and it doesn't seem as popular as mptt anyway).
According to these articles:
http://explainextended.com/2009/09/24/adjacency-list-vs-nested-sets-postgresql/
http://explainextended.com/2009/09/29/adjacency-list-vs-nested-sets-mysql/
"MySQL is the only system of the big four (MySQL, Oracle, SQL Server, PostgreSQL) for which the nested sets model shows decent performance and can be considered to stored hierarchical data."
http://www.sqlsummit.com/AdjacencyList.htm
The Adjacency List is much easier to maintain and Nested Sets are a lot faster to query.
The problem has always been that converting an Adjacency List to Nested Sets has taken way too long thanks to a really nasty "push stack" method that's loaded with RBAR. So people end up doing some really difficult maintenance in Nested Sets or not using them.
Now, you can have your cake and eat it, too! You can do the conversion on 100,000 nodes in less than 4 seconds and on a million rows in less than a minute! All in T-SQL, by the way! Please see the following articles.
Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets
Hierarchies on Steroids #2: A Replacement for Nested Sets Calculations

Resources