Polymorphic Hierarchy DB Design

Polymorphic Hierarchy DB Design - database

I'm looking for an efficient way to build a database model that can handle the following scenario:
The model needs to handle a hierarchy of undefined depth where each node can have 0 to 3 (at most) children where each child can be of one of 5-7 different types.
Below is an example of a sample tree that I need to support where 'foo', 'bar' etc. each would be entry in a different table and the number references the id from that table. The assumption is that foo.1, bar.2, and foo.3 are the top level nodes. Base is just a dummy object for me to point to the top level node of an unknown type.
Base.1
|-foo.1
|-foo.2
|-qaz.1
|-bar.1
Base.2
|-bar.2
|-qaz.2
|-foo.2
|-bar.3
Base.3
|-foo.3
Additionally, the order of the children must be maintained (i.e. it is not ok to switch qaz.2 with foo.2 in the above hierarchy).
From a data access perspective, most of the time I will be retrieving the entire tree inheriting from a top level object (Base.x).
The one thought I've had so far is to define a table with polymorphic association and each of the child node types reference that central table, such as:
BaseTable
---------
BaseId (PK)
ObjectId (FK)
ObjectTable
-----------
ObjectId (PK)
ObjectType
OtherTableId
BaseId (FK) gets top level parent
FooTable
---------
FooId (PK)
Child1_ObjectId (FK)
Child2_ObjectId (FK)
Child3_ObjectID (FK)
Other data fields...
etc. for bar and qaz
My idea was that I could use the BaseID FK in the ObjectTable to grab the entire tree, but is there an efficient way to reconstruct the whole tree through SQL, or would I need to do that in my code after retrieval? Or, is there a better way to store this kind of data in a way that is more efficient and guarantees relational integrity?

each would be entry in a different table
Is there a good reason for that?
Single Table Inheritance is the fastest, so if performance is your concern...
This will work:
create table foo (
foo_id int primary key,
type char(1) not null,
parent_id int null references foo(foo_id),
display_order int not null default 0,
foo_specific_data...,
bar_specific_data...,
qux_specific_data...,
unique( parent_id, display_order )
);
Then use recursive common table expressions to query (avoid MySQL/MaridDB). Use a trigger to enforce "max 3 children" rule / display order updates

Related

Database table circular reference between master and child table

I have two tables
person
person_photos
with one-to-many relationship (i.e. each person can have list of photos)
e.g.
person {
person_id number, <<THIS IS PK>>
person_name varchar,
other_columns...
}
person_photos {
person_photo_id number,<<THIS IS PK>>
person_id number, <<THIS IS FK>>
photo blob
}
I want one of the photo marked as default. Is it ok to have reference to the default photo in master table
i.e.
person {
person_id number,<<THIS IS PK>>
person_name varchar,
other_columns...
default_person_photo_id number <<Reference to child table>>
}
This basically creates the circular reference between two table.
Is there any issue with this approach?
Or any other better way of doing it?
Note:
I can introduce one column in person_photo table to mark which one is default however I primarily introducing this default photo id in master table to avoid getting that information by joinin the photo table
I can also create a mapping table, but I would like go with that approach only if there is any issue circular design

Part of this depends on what RDBMS you are using. If you are using one without partial unique indexes (like MySQL for example) then then that is probably the best way you can do it.
On the other hand if you can have partial unique indexes then you can do as follows:
Remove person.default_person_photo_id
Add a boolean person_photos.is_default fild
CREATE UNIQUE INDEX default_person_photos_idx ON person_photos(person_id) WHERE is_default
Then you can never have more than one, and if you search for a photo based on a person_id where is_default then the index can be used, possibly saving you a join.
So in answer to your question without knowing your rdbms's capabilities, I can't say you there is a better way, and you certainly aren't doing anything wrong. But for some RDBMSs there is a better way.

Postgresql inheritance based database design

I'm developing a simple babysitter application that has 2 types of users: a 'Parent' and the 'Babysitter'. I'm using postgresql as my database but I'm having trouble working out my database design.
The 'Parent' and the 'Babysitter' entities have attributes that can be generalized, for example: username, password, email, ... Those attributes could be
placed into a parent entity called 'User'. They both also have their own attributes, for example: Babysitter -> age.
In terms of OOP things are very clear for me, just extend the user class and you are good to go but in DB design things are differently.
Before posting this question I roamed around the internet for a good week looking for insight into this 'issue'. I did find a lot of information but
it seemed to me that there was a lot a disagreement. Here are some of the posts I've read:
How do you effectively model inheritance in a database?: Table-Per-Type (TPT), Table-Per-Hierarchy (TPH) and Table-Per-Concrete (TPC) VS 'Forcing the RDb into a class-based requirements is simply incorrect.'
https://dba.stackexchange.com/questions/75792/multiple-user-types-db-design-advice:
Table: `users`; contains all similar fields as well as a `user_type_id` column (a foreign key on `id` in `user_types`
Table: `user_types`; contains an `id` and a `type` (Student, Instructor, etc.)
Table: `students`; contains fields only related to students as well as a `user_id` column (a foreign key of `id` on `users`)
Table: `instructors`; contains fields only related to instructors as well as a `user_id` column (a foreign key of `id` on `users`)
etc. for all `user_types`
https://dba.stackexchange.com/questions/36573/how-to-model-inheritance-of-two-tables-mysql/36577#36577
When to use inherited tables in PostgreSQL?: Inheritance in postgresql does not work as expected for me and a bunch of other users as the original poster points out.
I am really confused about which approach I should take. Class-table-inheritance (https://stackoverflow.com/tags/class-table-inheritance/info) seems like the most correct in
my OOP mindset but I would very much appreciate and updated DB minded opinion.

The way that I think of inheritance in the database world is "can only be one kind of." No other relational modeling technique works for that specific case; even with check constraints, with a strict relational model, you have the problem of putting the wrong "kind of" person into the wrong table. So, in your example, a user can be a parent or a babysitter, but not both. If a user can be more than one kind-of user, then inheritance is not the best tool to use.
The instructor/student relationship really only works well in the case where students cannot be instructors or vice-versa. If you have a TA, for example, it's better to model that using a strict relational design.
So, back to the parent-babysitter, your table design might look like this:
CREATE TABLE user (
id SERIAL,
full_name TEXT,
email TEXT,
phone_number TEXT
);
CREATE TABLE parent (
preferred_payment_method TEXT,
alternate_contact_info TEXT,
PRIMARY KEY(id)
) INHERITS(user);
CREATE TABLE babysitter (
age INT,
min_child_age INT,
preferred_payment_method TEXT,
PRIMARY KEY(id)
) INHERITS(user);
CREATE TABLE parent_babysitter (
parent_id INT REFERENCES parent(id),
babysitter_id INT REFERENCES babysitter(id),
PRIMARY KEY(parent_id, babysitter_id)
);
This model allows users to be "only one kind of" user - a parent or a babysitter. Notice how the primary key definitions are left to the child tables. In this model, you can have duplicated ID's between parent and babysitter, though this may not be a problem depending on how you write your code. (Note: Postgres is the only ORDBMS I know of with this restriction - Informix and Oracle, for example, have inherited keys on inherited tables)
Also see how we mixed the relational model in - we have a many-to-many relationship between parents and babysitters. That way we keep the entities separated, but we can still model a relationship without weird self-referencing keys.

All the options can be roughly represented by following cases:
base table + table for each class (class-table inheritance, Table-Per-Type, suggestions from the dba.stackexchange)
single table inheritance (Table-Per-Hierarchy) - just put everything into the single table
create independent tables for each class (Table-Per-Concrete)
I usually prefer option (1), because (2) and (3) are not completely correct in terms of DB design.
With (2) you will have unused columns for some rows (like "age" will be empty for Parent). And with (3) you may have duplicated data.
But you also need to think in terms of data access. With option (1) you will have the data spread over few tables, so to get Parent, you will need to use join operations to select data from both User and Parent tables.
I think that's the reason why options (2) and (3) exist - they are easier to use in terms of SQL queries (no joins are needed, you just select the data you need from one table).

What is the best way to keep this schema clear?

Currently I'm working on a RFID project where each tag is attached to an object. An object could be a person, a computer, a pencil, a box or whatever it comes to the mind of my boss.
And of course each object have different attributes.
So I'm trying to have a table tags where I can keep a register of each tag in the system (registration of the tag). And another tables where I can relate a tag with and object and describe some other attributes, this is what a have done. (No real schema just a simplified version)
Suddenly, I realize that this schema could have the same tag in severals tables.
For example, the tag 123 could be in C and B at the same time. Which is impossible because each tag just could be attached to just a single object.
To put it simple I want that each tag could not appear more than once in the database.
My current approach
What I really want
Update:
Yeah, the TagID is chosen by the end user. Moreover the TagID is given by a Tag Reader and the TagID is a 128-bit number.
New Update:
The objects until now are:
-- Medicament(TagID, comercial_name, generic_name, amount, ...)
-- Machine(TagID, name, description, model, manufacturer, ...)
-- Patient(TagID, firstName, lastName, birthday, ...)
All the attributes (columns or whatever you name it) are very different.
Update after update
I'm working on a system, with RFID tags for a hospital. Each RFID tag is attached to an object in order keep watch them and unfortunately each object have a lot of different attributes.
An object could be a person, a machine or a medicine, or maybe a new object with other attributes.
So, I just want a flexible and cleaver schema. That allow me to introduce new object's types and also let me easily add new attributes to one object. Keeping in mind that this system could be very large.
Examples:
Tag(TagID)
Medicine(generic_name, comercial_name, expiration_date, dose, price, laboratory, ...)
Machine(model, name, description, price, buy_date, ...)
Patient(PatientID, first_name, last_name, birthday, ...)
We must relate just one tag for just one object.
Note: I don't really speak (or also write) really :P sorry for that. Not native speaker here.

You can enforce these rules using relational constraints. Check out the use of a persisted column to enforce the constraint Tag:{Pencil or Computer}. This model gives you great flexibility to model each child table (Person, Machine, Pencil, etc.) and at same time prevent any conflicts between tag. Also good that we dont have to resort to triggers or udfs via check constraints to enforce the relation. The relation is built into the model.
create table dbo.TagType (TagTypeID int primary key, TagTypeName varchar(10));
insert into dbo.TagType
values(1, 'Computer'), (2, 'Pencil');
create table dbo.Tag
( TagId int primary key,
TagTypeId int references TagType(TagTypeId),
TagName varchar(10),
TagDate datetime,
constraint UX_Tag unique (TagId, TagTypeId)
)
go
create table dbo.Computer
( TagId int primary key,
TagTypeID as 1 persisted,
CPUType varchar(25),
CPUSpeed varchar(25),
foreign key (TagId, TagTypeID) references Tag(TagId, TagTypeID)
)
go
create table dbo.Pencil
( TagId int primary key,
TagTypeId as 2 persisted,
isSharp bit,
Color varchar(25),
foreign key (TagId, TagTypeID) references Tag(TagId, TagTypeId)
)
go
-----------------------------------------------------------
-- create a new tag of type Pencil:
-----------------------------------------------------------
insert into dbo.Tag(TagId, TagTypeId, TagName, TagDate)
values(1, 2, 'Tag1', getdate());
insert into dbo.Pencil(TagId, isSharp, Color)
values(1, 1, 'Yellow');
-----------------------------------------------------------
-- try to make it a Computer too (fails FK)
-----------------------------------------------------------
insert into dbo.Computer(TagId, CPUType, CPUSpeed)
values(1, 'Intel', '2.66ghz')

Have a Tag Table with PK identity insert of TagID.
This will ensure that each TagID only shows up once no matter what...
Then in the Tag Table have a TagType column that can either be free form (TableName) or better yet have a TagType table with entries A,B,C and then have a FK in Tag pointing TagType.
I would move the Tag attributes into Table A,B,C to minimize extra data in Tag or have a series of Junction Tables between Tag and A,B, and C
EDIT:
Assuming the TagID is created when the object is created this will work fine (Insert into Tag first to get TagID and capture it using IDENTITY_INSERT)
This assumes users cannot edit the TagID itself.
If users can choose the TagID then still use a Tag Table with the TagID but have another field called DisplayID where the user can type in a number. Just put on a unique constraint on Tag.DisplayID....
EDIT:
What attributes are you needing and are they nullable? If they are different for A, B, and C then it is cleaner to put them in A, B, and C especially if there might be some for A and B but not C...

talked with Raz to clear up what he's trying to do. What he's wanting is a flexable way to store attributes related to tags. Tags can one of multiple types of objects, and each object has a specific list of attributes. he also wants to be able to add objects/attributes without having to change the schema. here's the model i came up with:

if each tag can only be in a, b, or c only once, i'd just combine a, b, and c into one table. it'd be easier to give you a better idea of how to build your schema if you gave an example of exactly what you're wanting to collect.
to me, from what i've read, it sounds like you have a list of tags, and a list of objects, and you need to assign a tag to an object. if that is the case, i'd have a tags table, and objects table, and a ObjectTag table. in the object tab table you would have a foreign key to the tag table and a foreign key to the object table. then you make a unique index on the tag foreign key and now you've enforced your requirement of only using a tag once.

I would tackle this using your original structures. Relational databases are a lot better at aggregating/combining atomic data than they are at parsing complex data structures.
Keep the design of each "tag-able" object type in its own table. Data types, check constraints, default values, etc. are still easily implemented this way. Also, continue to define a FK from each object table to the Tags table.
I'm assuming you already have this in place, but if you place a unique constraint on the TagId column in each of the object tables (A, B, C, etc.) then you can guarantee uniqueness within that object type.
There are no built-in SQL Server constraints to guarantee uniqueness among all the object types, if implemented as separate tables. So, you will have to make your own validation. An INSTEAD OF trigger on your object tables can do this cleanly.
First, create a view to access the TagId list across all your object tables.
CREATE VIEW TagsInUse AS
SELECT A.TagId FROM A
UNION
SELECT B.TagId FROM B
UNION
SELECT C.TagId FROM C
;
Then, for each of your object tables, define an INSTEAD OF trigger to test your TagId.
CREATE TRIGGER dbo.T_IO_Insert_TableA ON dbo.A
INSTEAD OF INSERT
AS
IF EXISTS (SELECT 0 FROM dbo.TagsInUse WHERE TagId = inserted.TagId)
BEGIN;
--The tag(s) is/are already in use. Create the necessary notification(s).
RAISERROR ('You attempted to re-use a TagId. This is not allowed.');
ROLLBACK
END;
ELSE
BEGIN;
--The tag(s) is/are available, so proceed with the INSERT.
INSERT INTO dbo.A (TagId, Attribute1, Attribute2, Attribute3)
SELECT i.TagId, i.Attribute1, i.Attribute2, i.Attribute3
FROM inserted AS i
;
END;
GO
Keep in mind that you can also (and probably should) encapsulate that IF EXISTS test in a T-SQL function for maintenance and performance reasons.
You can write supplementary stored procedures for doing things like finding what object type a TagId is associated with.
Pros
You are still taking advantage of SQL Server's data integrity features, which are all quite fast and self-documenting. Don't underestimate the usefulness of data types.
The view is an encapsulation of the domain that must be unique without combining the underlying sets of attributes. Now, you won't have to write any messy code to decipher the object's type. You can base that determination by which table contains the matching tag.
Your options remain open...
Because you didn't store everything in an EAV-friendly nvarchar(300) column, you can tweak the data types for whatever makes the most sense for each attribute.
If you run into any performance issues, you can index the view.
You (or your DBA) can move the object tables to different file groups on different disks if you need to balance things out and help with parallel disk I/O. Think of it as a form of horizontal partitioning. For example, if you have 8 times as many RFID tags applied to medicine containers as you have for patients, you can place the medicine table on a different disk without having to create the partitioning function that you would need for a monolithic table (one table for all types).
If you need to eventually partition your tables vertically (for archiving data onto a read-only partition), you can more easily create a partitioning function for each object type. This would be useful where the business rules do
Most importantly, implementing different business rules based on object type is much simpler. You don't have to implement any nasty conditional logic like "IF type = 'needle' THEN ... ELSE IF type = 'patient' THEN ... ELSE IF....". If you need to apply different rules, then apply them to the relevant object table without having to test a "type" value.
Cons
Triggers have to be maintained. However, this would have to be done in your application anyway, so you are performing the same data integrity checking at the database. That means that you will have no extra network overhead and this will be available for any application that uses this database.

What you're describing is a classical "table-per-type" ORM mapping. Entity Framework has built-in support of this, which you should look into.
Otherwise, I don't think most databases have easy integrity constraints that are enforced over primary keys of multiple tables.
However, is there any reason why you can't just use a single tags table to hold all the fields? Use a type field to hold the type of object. NULL all the irrelevant fields -- this way they don't consume disk space. You'll end up with far fewer tables (only one) that you can maintain as one single coherent object; it also makes you write far fewer SQL queries to work on tags that may span multiple object types.
Implementing it as one single table also saves you disk space because you can implement tiers of inheritance -- for example, "patient" and "doctor" and "nurse" can be three different object types, each having similar fields (e.g. firstname, lastname etc.) and some unique fields. Right now you'll need three tables, with duplicated fields.
It is also simpler when you add an object type. Before, you need to add a new table, and duplicate some SQL statements that span multiple object types. Now you only need to add new fields to the same table (maybe reuse some). The SQL you need to change are far fewer.
The only reason why you won't go with one single table is when the number of fields make a row too large to fit inside a SQL-Server page (which I believe is 8K). Then SQL will complain and won't allow you to add any more fields. The solution, in this case, is to adopt an ORM tool (like Entity Framework), and then "reuse" fields. For example, if "Field1" is only used by object type #1, there is no reason why object type #3 can't use it to store something as well. You only need to be able to distinguish it in your programs.

You could have the Tags table such that it can have a pointer to any of those tables, and could include a Type that tells you which of the tables it is
Tags
-
ID
Type (A,B, or C)
A (nullable)
B (nullable)
C (nullable)
A
-
ID
(other attributes)

Polymorphic ORM database pattern

I remember when - a long time ago - I was messing around with the Java ActiveObjects ORM, I came across a database pattern it claimed to support.
However, it is very difficult to find the pattern's name, by search for the general idea, thus I would really appreciate it if someone could give me the name of this pattern, and some thoughts on the "cleanness" of using it.
The pattern was defined as such:
Table:
reference_type <enum>
reference <integer>
...
... where the value of the field reference_type would determine the type (and thus the table) to which was being referred. Thus:
User:
location_type <l&l, address, city, country>
location <integer>
...
... where depending on the value of the location_type field, the foreign key location would refer to either the l&l, address, city or country table.

You're having difficulty finding it because it's not a real (in the sense of widely adopted and encouraged) database design pattern.
Stay away from patterns like this. While ORM's make mapping database tables to types easier, tables are not types, and vice versa. While it's not clear what the model you've described is supposed to do, you should not have columns that serve as fake foreign keys to multiple tables (when I say "fake", I mean that you're storing a simple identifier value that corresponds to the primary key of another table, but you can't actually define the column as a foreign key).
Model your database to represent the data, model your objects to represent the process, and use your ORM and intermediate layers to do the translation; don't try to push the database into your code, and don't push your code into the database.
Edit in reponse to comment
You're mixing database and OO terminology; while I'm not familiar with the syntax you're using to define that function, I'm assuming it's an instance function on the User type called getLocation that takes no parameters and returns a Location object. Databases don't support the concepts of instance (or any type-based) functions; relational databases can have user-defined functions, but these are simple procedural functions that take parameters and return either values or result sets. They do not correspond to particular tables or field in any way, other than the fact that you can use them within the body of the function.
That being said, there are two questions to answer here: how to do what you've asked, and what might be a better solution.
For what you've asked, it sounds like you have a supertype-subtype relationship, which is a standard database design pattern. In this case, you have a single supertype table that represents the parent:
Location
---------------
LocationID (PK)
...other common attributes
(Note here that I'm using LocationID for the sake of simplicity; you should have more specific and logical attributes to define the primary key, if possible)
Then you have one or more tables that define subtypes:
Address
-----------
LocationID (PK, FK to Location)
...address-specific attributes
Country
-----------
LocationID (PK, FK to Location)
...country-specific attributes
If a specific instance of Location can only be one of the subtypes, then you should add a discriminator value to the parent table (Location) that indicates which of the subtypes it corresponds to. You can use CHECK constraints to ensure that only valid values are in this field for a given row.
In the end, though, it sounds like you might be better served with a hybrid approach. You're fundamentally representing two different types of locations, from what I can see:
Coordinate-based locations (L&L)
Municipal/Postal/Etc.-based locations (Country, City, Address), and each of these is simply a more specific version of the previous
Given this, a simple model would look like this:
Location
------------
LocationID (PK)
LocationType (non-nullable) ('C' for coordinate, 'P' for postal)
LocationCoordinate
------------------
LocationID (PK; FK to Location)
Latitude (non-nullable)
Longitude (non-nullable)
LocationPostal
------------------
LocationID (PK, FK to Location)
Country (non-nullable)
City (nullable)
Address (nullable)
Now the only problem that remains is that we have nullable columns. If you want to keep your queries simple but take (justified!) flak from people about leaving nullable columns, then you can leave it as-is. If you want to go to what most people would consider a better-designed database, you can move to 6NF for our two nullable columns. Doing this will also have the nice side-effect of giving us a little more control over how these fields are populated without having to do anything extra.
Our two nullable fields are City and Address. I am going to assume that having an Address without a City would be nonsense. In this case, we remove these two attributes from the LocationPostal table and create two more tables:
LocationPostalCity
------------------
LocationID (PK; FK to LocationPostal)
City (non-nullable)
LocationPostalCityAddress
-------------------------
LocationID (PK; FK to LocationPostalCity)
Address (non-nullable)

Seems to me that city and country would be part of the address table, and that L&L wouldn't be mutually exclusive with address (you might have both...), so, why limit yourself like that to one or the other?
Further more, this would prevent the location column from enforcing referential integrity, would it not, since it wouldn't always reference the same table?

One big table or separate tables to store product reviews of part types?

I need to make 100 or so tables. I have tables called PartStatsXXX and the tables to be made will all be called PartReviewXXX (they pair up with each other in a 1:n relationship).
Is it efficient to create one big table to store all product (product and part being the same term from a business perspective) reviews? Someone mentioned making a relationship from PartStatsXXX to PartsReview (one large table) with the value of XXX as part of the primary key from PartStatsXXX.
XXX is the name of the part type (eg battery, wiring loom, etc). So this will be varchar. Should I make a composite key? The part type wouldn't change names (though some part names can have multiple names depending on culture), but it's not really a candidate ID. It was then mentioned I could get several views for what I need depending on the value of XXX.
I hope this makes sense. What would be the best approach?
Thanks

Multi-table PartStatsXXX is a bad idea: hard to code properly or with a framework, harder to maintain, nightmare to query...
Use two tables: PartStats and PartsReview, with approriate keys and indexes for performance.

It is more efficient to create tables based on what you want to store in each one. You do not need 100 tables for 100 products. you need 1 table for all products.
So for your needs I would create 2 tables:
products
========
id INT
name VARCHAR
product_reviews
===============
id INT
product_id INT (foreign key to products.id)
rating INT (example column)

Unless you are storing different types of data for each product's reviews (i.e., each table has a different set of columns), using a different table per product will be creating an unnecessary nightmare.
As a general rule, you never want to have more than one table with the same set of columns. As already suggested, one table with a "product_id" column is the way to go.

If you want to save yourself some pain in a quick-and-dirty way, use two tables.
CREATE TABLE PartStats (
...,
PartType VARCHAR(255),
...
);
CreateTable PartReview (
...
PartType VARCHAR(255),
...
);
and then join them up via
SELECT ...
FROM PartStats ps JOIN PartReview pr
ON ps.PartType = pr.PartType;
This gets you out from having hundreds of tables, but sets you up for a different problem: Redundant data (PartType) that can get out of sync. A typo in a PartType can yield an orphaned review.
The solution here, assuming that you can have more than one PartStats entry for a given PartType, is to add a third table to the sole older of PartType names.
CREATE TABLE PartType (
ID INT ...,
PartType VARCHAR(255),
PRIMARY KEY (ID)
);
and arrange for PartStats and PartReview to use the ID of a PartType. For example,
CREATE TABLE PartStats (
...,
PartType_ID INT REFERENCES PartType(ID),
...
);
CREATE TABLE PartReviews (
...
PartType_ID INT REFERENCES PartType(ID),
...
);
This will prevent your making a PartStats or a PartReview for a non-existent PartType.
If query performance becomes an issue, adding secondary indexes on PartType_ID will help.

I can recommend you a couple of not bad books on database design (several months ago I decided to improve my database design skills so I took a look at several different books and chose these two):
1) Pro SQL Server 2008 Relational Database Design and Implementation (c) Louis Davidson
2) Relational database design clearly explain (c) Jan Harrington
Good luck!