Unlimited levels of hierarchy in SQL table - PostgreSQL - database

I am looking for a way to store and handle unlimited level of hierarchy for various organisations/entities stored in my DB. For example, instead of having just one parent and one child organisation (e.g. 2 levels of hierarchy) and just one-to-many relationship as allowed by self-join (e.g. having another column called parent referring to the IDs of the same table), I want to be able to have as many levels of hierarchy as possible and as many connections as possible.
Supposing I have an organisation table such as the following:
ID
Name
Other Non-related data
1
Test1
NULL
2
Test2
NULL
3
Test3
something
4
Test4
something else
5
Test5
etc
I am considering the following solution; for each table that I need this I can add another table named originalTable_hierarchy which refers to the organisation table in both columns and make it look like this:
ID
Parent ID
ChildID
1
1
2
2
2
4
3
3
1
4
3
2
5
2
3
From this table I can tell that 1 is parent to 2, 2 is parent to 4, 3 is parent to 1, 3 is also parent to 2, 2 is also parent to 3.
The restrictions I can think of are not to have the same ParentID and ChildID (e.g. a tuple like (3,3)) and not to have a record that puts them into the opposite order (e.g. if I have the (2,3) tuple, I can't also have (3,2))
Is this the correct solution for multiple organisations and suborganisations I might have later on? Users will have to navigate through them easily back and forth. If users decide to split one organisation into many, does this solution suffice? What else should I consider (extra or missing perks) when doing this instead of a traditional self-join or a certain number of tables for certain levels of hierarchy (e.g. organisaion table and suborganisation table)? Also, can you impose restrictions on certain records, so that no more childs of a certain parent can be created? Or to report on all the childs of an original parent?
Please feel free to also instruct on where to read more about this. Any relevant resources are welcome.

You only need a single table as having just one parent and one child allows an unlimited (theoretical anyway) levels in the hierarchy. You do this by reversing the relationship so that the Child references the Parent. (Your table has the Parent referencing the Child). This results in allowing a child, at any level, also being a parent. This can be chained as far as needed.
create table organization ( id integer primary key
, name text
, parent_id integer references organization(id)
, constraint parent_not_self check (parent_id <> id)
) ;
create unique index organization_not__mirrored
on organization( least(id,parent_id), greatest(id,parent_id) );
The check constraint enforces you first restriction and the unique index the second.
The following query shows the full hierarchy, along with the full path and the level.
with recursive hier(parent_id, child_id, path, level) as
( select id, parent_id, id::text, 1
from organization
where parent_id is null
union all
select o.id, o.parent_id,h.path || '->' ||o.id::text,h.level+1
from organization o
join hier h
on (o.parent_id = h.parent_id)
)
select * from hier;
See demo here.

Related

Polymorphic Hierarchy DB Design

I'm looking for an efficient way to build a database model that can handle the following scenario:
The model needs to handle a hierarchy of undefined depth where each node can have 0 to 3 (at most) children where each child can be of one of 5-7 different types.
Below is an example of a sample tree that I need to support where 'foo', 'bar' etc. each would be entry in a different table and the number references the id from that table. The assumption is that foo.1, bar.2, and foo.3 are the top level nodes. Base is just a dummy object for me to point to the top level node of an unknown type.
Base.1
|-foo.1
|-foo.2
|-qaz.1
|-bar.1
Base.2
|-bar.2
|-qaz.2
|-foo.2
|-bar.3
Base.3
|-foo.3
Additionally, the order of the children must be maintained (i.e. it is not ok to switch qaz.2 with foo.2 in the above hierarchy).
From a data access perspective, most of the time I will be retrieving the entire tree inheriting from a top level object (Base.x).
The one thought I've had so far is to define a table with polymorphic association and each of the child node types reference that central table, such as:
BaseTable
---------
BaseId (PK)
ObjectId (FK)
ObjectTable
-----------
ObjectId (PK)
ObjectType
OtherTableId
BaseId (FK) gets top level parent
FooTable
---------
FooId (PK)
Child1_ObjectId (FK)
Child2_ObjectId (FK)
Child3_ObjectID (FK)
Other data fields...
etc. for bar and qaz
My idea was that I could use the BaseID FK in the ObjectTable to grab the entire tree, but is there an efficient way to reconstruct the whole tree through SQL, or would I need to do that in my code after retrieval? Or, is there a better way to store this kind of data in a way that is more efficient and guarantees relational integrity?
each would be entry in a different table
Is there a good reason for that?
Single Table Inheritance is the fastest, so if performance is your concern...
This will work:
create table foo (
foo_id int primary key,
type char(1) not null,
parent_id int null references foo(foo_id),
display_order int not null default 0,
foo_specific_data...,
bar_specific_data...,
qux_specific_data...,
unique( parent_id, display_order )
);
Then use recursive common table expressions to query (avoid MySQL/MaridDB). Use a trigger to enforce "max 3 children" rule / display order updates

One Child Table For Multiple Parents Or Multiple Child Table For Multiple Parents

I am a bit confused while designing my table structure for managing addresses and contacts for multiple entities .
In my case I have four types of entity which would have multiple address and contact .
I have created 4 parent table for 4 entities .
But while creating the address and contacts table , I am thinking of creating just one address and one contact table for for all parent entities and link them through a Entity Key .some thing like below .
Id
ContactType
ParentId
EntityKey
Value
So is it a good idea or should I create 4 individual child tables each for one type of entity . All my child tables would have the same structure .
Please suggest me . I am not sure which design to follow .
I created all my parent tables and one respective child table for address and one for contact . I didn't create any relation between parent and child entities . I made the relation through one more table where I did the cross reference between all parent entities and child entity . I added a check constraint which will check only one parent entity id is filled and other parent entity id columns are null .I also added a unique index to avoid duplicate entries . In this way I can also maintain the referential integrit .
Case 1 : millions of data in no time.query is frequently use and it is use in multiple places.There may be change in table structure of any parent-child. then
multiple parent and multiple child
Reason : When table grow so rapidly,then you do partition on table.Then why not keep separate table from the beginning.Also you will have one less where condition.
Case 2 :When growth of table is normal.and query use on it is in very few places,and concurrency is less.then you can have
One Parent and One Child table also
------------ According the URL reference---------
I am telling you exactly same,where
Entity=Address table with other details and AddressTypeID and ParentID
I am only saying no need of further normalisation like creating Table for AddressType and referencing with AddressTypeID.Since AddressTypeID can be maximum upto 4-5.And while insert you can pass hard coded value like 1,2
Say i want to join Address table with Entity1 then my query
From Address A inner join Entity1 E on A.ParentID=P.EntityID1
where AddressTypeID =1
similarly
From Address A inner join Entity2 E on A.ParentID=P.EntityID2
where AddressTypeID =2
and so on.
Suppose in another example when you can have n number of type in AddressType
then you are bound to create separate table like in that url

Materialize a CTE, or otherwise increase performance

Given a table (AccountId, ParentId NULL), we want to be able to quickly find:
1. The master parent ID (the accountId where ParentId is null).
2. All children for a given account ID.
With a CTE this is fairly easy. However, we can't save the CTE in an indexed view, which hurts performance. We've kicked around some other ideas, like saving the path (id1/id2/id3) in another field, but that feels sorta hacky.
We thought of a trigger that'd save the "master" ID on each row, but we're unsure how that'd work in the middle of a chain (1 owns 2 owns 3, but then 2 transfers to 7). It also doesn't solve the "find all children" query.
Any thoughts? We're using SQL 2008 R2, but can move to SQL 2012.
In SQL 2008, there is a hierarchyid type that basically implements the saving the path to the root. http://technet.microsoft.com/en-us/library/bb677290%28v=sql.100%29.aspx
If your hierarchy is mostly static, nother option is to have a de-normalized version of this table with the combination of a parent to every descendant. So if your hierarchy is A is a parent of B who is a parent of C, the denormalized table can look like this
parent child depth
A A 0
A B 1
A C 2
B B 0
B C 1
C C 0
Now if you index both the parent and the child columns, searching the hierarchy becomes very fast.

Best way to create a unique number for each many to many relationship

I have a table of Students and a table of Courses that are connected through an intermediate table to create a many-to-many relationship (ie. a student can enroll in multiple courses and a course can have multiple students). The problem is that the client wants a unique student ID per course. For example:
rowid Course Student ID (calculated)
1 A Ben 1
2 A Alex 2
3 A Luis 3
4 B Alex 1
5 B Gail 2
6 B Steve 3
The ID's should be numbered from 1 and a student can have a different ID for different course (Alex for example has ID=2 for course A, but ID=1 for Course B). Once an ID is assigned it is fixed and cannot change. I implemented a solution by ordering on the rowid of the through table "SELECT Student from table WHERE Course=A ORDER BY rowid" and then returning a number based on the order of the results.
The problem with this solution, is that if a student leaves a course (is deleted from the table), the numbers of the other students will change. Can someone recommend a better way? If it matters, I'm using PostgreSQL and Django. Here's what I've thought of:
Creating a column for the ID instead of calculating it. When a new relationship is created assigning an ID based on the max(id)+1 of the students in the course
Adding a column "disabled" and setting it True when a student leaves the course. This would involve changing all my code to make sure that only active students are used
I think the first solution is better, but is there a more "database centric way" where the database can calculate this for me automatically?
If you want to have stable ID's, you certanly need to store them in the table.
You'll need to assign a new sequential ID for every student that joins a course and just delete it if the student leaves, without touching others.
If you have concurrent access to your tables, don't use MAX(id), as two queries can select same MAX(id) before inserting it into the table.
Instead, create a separate table to be used as a sequence, lock each course's row with SELECT FOR UPDATE, then insert the new student's ID and update the row with a new ID in a single transaction, like this:
Courses:
Name NextID
------- ---------
Math 101
Physics 201
Attendants:
Student Course Id
------- ------ ----
Smith Math 99
Jones Math 100
Smith Physics 200
BEGIN TRANSACTION;
SELECT NextID
INTO #NewID
FROM Courses
WHERE Name = 'Math'
FOR UPDATE;
INSERT
INTO Attendants (Student, Course, Id)
VALUES ('Doe', 'Math', #NewID);
UPDATE
Courses
SET NextID = #NewID + 1
WHERE Course = 'Math';
COMMIT;
Your first suggestions seems good: have a last_id field in the course table that you increase by 1 any time you enroll a student in that course.
Creating a column for the ID instead
of calculating it. When a new
relationship is created assigning an
ID based on the max(id)+1 of the
students in the course
That how I'd do it. There is no point of calculating it. And the id's shouldn't change just because someone dropped out.
Adding a column "disabled" and setting
it True when a student leaves the
course.
Yes, that would be a good idea. Another one is creating another table of same structure, where you'll store dropped students. Then of course you'll have to select max(id) from union of these two tables.
I think there are two concepts that you need to help you out here.
Sequences where the database gets the next value for an ID for you automatically
Composite keys where more than one column can be combined to make the primary key of a table.
From a quick google it looks like Django can handle sequences but not composite keys, so you will need to emulate that somehow. However you could equally have two foreign keys and a sequence for the course/student relationship
As for how to handle deletions, it depends on what you need from your app, you may find that a status field would help you as you may want to differentiate between students who left and those that were kicked out, or get statistics on how many students leave different courses.

Hierarchical Data Structure Design (Nested Sets)

I'm working on a design for a hierarchical database structure which models a catalogue containing products (this is similar to this question). The database platform is SQL Server 2005 and the catalogue is quite large (750,000 products, 8,500 catalogue sections over 4 levels) but is relatively static (reloaded once a day) and so we are only concerned about READ performance.
The general structure of the catalogue hierarchy is:-
Level 1 Section
Level 2 Section
Level 3 Section
Level 4 Section (products are linked to here)
We are using the Nested Sets pattern for storing the hierarchy levels and storing the products which exist at that level in a separate linked table. So the simplified database structure would be
CREATE TABLE CatalogueSection
(
SectionID INTEGER,
ParentID INTEGER,
LeftExtent INTEGER,
RightExtent INTEGER
)
CREATE TABLE CatalogueProduct
(
ProductID INTEGER,
SectionID INTEGER
)
We do have an added complication in that we have about 1000 separate customer groups which may or may not see all products in the catalogue. Because of this we need to maintain a separate "copy" of the catalogue hierarchy for each customer group so that when they browse the catalogue, they only see their products and they also don't see any sections which are empty.
To facilitate this we maintain a table of the number of products at each level of the hierarchy "rolled up" from the section below. So, even though products are only directly linked to the lowest level of the hierarchy, they are counted all the way up the tree. The structure of this table is
CREATE TABLE CatalogueSectionCount
(
SectionID INTEGER,
CustomerGroupID INTEGER,
SubSectionCount INTEGER,
ProductCount INTEGER
)
So, onto the problem
Performance is very poor at the top levels of the hierarchy. The general query to show the "top 10" products in the selected catalogue section (and all child sections) is taking somewhere in the region of 1 minute to complete. At lower sections in the hierarchy it is faster but still not good enough.
I've put indexes (including covering indexes where applicable) on all key tables, run it through the query analyzer, index tuning wizard etc but still cannot get it to perform fast enough.
I'm wondering whether the design is fundamentally flawed or whether it's because we have such a large dataset? We have a reasonable development server (3.8GHZ Xeon, 4GB RAM) but it's just not working :)
Thanks for any help
James
Use a closure table. If your basic structure is a parent-child with the fields ID and ParentID, then the structure for a closure table is ID and DescendantID. In other words, a closure table is an ancestor-descendant table, where each possible ancestor is associated with all descendants. You may include a LevelsBetween field if you need. Closure table implementations usually include self-referencing records, i.e. ID 1 is an ancestor of descendant ID 1 with LevelsBetween of zero.
Example:
Parent/Child
ParentID - ID
1 - 2
1 - 3
3 - 4
3 - 5
4 - 6
Ancestor/Descendant
ID - DescendantID - LevelsBetween
1 - 1 - 0
1 - 2 - 1
1 - 3 - 1
1 - 4 - 2
1 - 6 - 3
2 - 2 - 0
3 - 3 - 0
3 - 4 - 1
3 - 5 - 1
3 - 6 - 2
4 - 4 - 0
4 - 6 - 1
5 - 5 - 0
The table is intended to eliminate recursive joins. You push the load of the recursive join into an ETL cycle that you do when you load the data once a day. That shifts it away from the query.
Also, it allows variable-level hierarchies. You won't be stuck at 4.
Finally, it allows you to slot products in non-leaf nodes. A lot of catalogs create "Miscellaneous" buckets at higher levels of the hierarchy to create a leaf-node to attach products to. You don't need to do that since intermediate nodes are included in the closure.
As far as indexing goes, I would do a clustered index on ID/DescendantID.
Now for your query performance. This takes a chunk out but not all. You mentioned a "Top 10". This implies ranking over a set of facts that you haven't mentioned. We need details to help tune those. Plus, this gets only gets the leaf-level sections, not the products. At the very least, you should have an index on your CatalogueProduct that orders by SectionID/ProductID. I would force Section to Product joins to be loop joins based on the cardinality you provided. A report on a catalog section would go to the closure table to get descendants (using a clustered index seek). That list of descendants would then be used to get products from CatalogueProduct using the index by looped index seeks. Then, with those products, you would get the facts necessary to do the ranking.
you might be able to solve the customer groups problem with roles and treeId's but you'll have to provide us with the query.
Might it be possible to calculate the ProductCount and SubSectionCount after the load each day?
If the data is changing only once a day surely it's worthwhile to calculate these figures then, even if some denormalization is required.

Resources