Handling inherited and not inherited values in PostgreSQL - database

I'm trying to draw database schema and struggling with inheritance. Let me tell with an example:
I have tables List and Item:
Table List{
id varchar [not null, pk]
parent_list_id varchar [ref: - List.id]
}
Table Item{
id varchar [not null, pk]
list_id varchar [ref: > List.id]
}
Table Rule{
id varchar [not null, pk]
list_id varchar [ref: > List.id]
item_id varchar [ref: > Item.id]
}
There can be multiple lists and they inherit their Rules from a upper List. A List can have multiple Items and multiple Rules and all Lists and Items that belong to a List should inherit their Rules from the parent List. Sub Lists and Items can also set their own Rules and in case of a List the Rules are inherited to their sub Lists.
An example:
List_A
Rule_A (set exclusively to List_A)
Rule_B (set exclusively to List_A)
List_B
Rule_A (inherited)
Rule_B (inherited)
Rule_C (set exclusively to List_B)
Item_A
Rule_A (inherited)
Rule_B (inherited)
Rule_C (inherited)
Rule_D (set exclusively to Item_A)
Now, this shouldn't be such a big problem, but I also need to be able to set rules active or inactive. I considered setting a boolean active to my Model Rule but I think there are problems with that approach. If I want to set Rule_A and Rule_B to be inactive for Item_A (they would still be active under List_A and List_B), that is a problem since my Rules would have a list_id referring to the List_A. I would be really grateful if someone had some suggestions how to handle this kind of inheritance, I'm a bit stuck here.

What you have here is, paraphrasing your description in Entity–Relation terms:
A single Item (or a List) can have multiple Rules relating to it.
A single Rule can have multiple Items (or Lists) relating to it.
(The inheritance aspect is out of this model, imo, and is basically a "prefill Rule–Item relation when there's a copy created", if that makes sense).
Which is a classic many-to-many case. And the most obvious way of handling that kind of a relation with a junction tables. What's cool about those, is that you can have some additional fields in the junction table that hold some additional attributes. For example enabled boolean:
Table List {
id varchar [not null, pk]
parent_list_id varchar [ref: - List.id]
}
Table Item {
id varchar [not null, pk]
list_id varchar [ref: > List.id]
}
Table Rule {
id varchar [not null, pk]
}
Table rule_to_list {
-- composite PK (list_id, rule_id)
list_id varchar [ref: > List.id]
rule_id varchar [ref: > Rule.id]
enabled boolean
}
Table rule_to_item {
-- composite PK (item_id, rule_id)
item_id varchar [ref: > Item.id]
rule_id varchar [ref: > Rule.id]
enabled boolean
}
With this schema you will be able to easily enable/disable individual rules in regards to specific list/item, and even remove them completely if there's a need for that.
The only thing you will need to implement, is copying parent item/list rules to junction tables, but that's pretty easy. Or you can go with implicit "all items/lists inherit rules from their parents unless there's an override" (ie parent list has rule A enabled, and child list has A explicitly disabled in a junction table) and just traverse up the hierarchy when computing rule list for an item/list.

Related

How to use foreign key in two tables based on a flag column?

I have a parent table Tree and two child tables Post and Department.
Based on the Flag column this relation must be set.
How can I do this?
You cannot do that with foreign keys. You could implement a trigger which would check for the ReferenceID presence either on the Post or on the Department table based oh the Flag column.
Although the best approach would be to change your design to have 2 nullable columns as follows, and ensuring only one of them has a value:
CREATE TABLE Tree (
ID Integer NOT NULL,
PostID Integer REFERENCES Post(ID),
DepartmentID Integer REFERENCES Department(ID),
Flag INTEGER NOT NULL
)

Feedback on database table structure for queue containing arbitrary tasks

I want to introduce a queue functionality in an existing application built on Access VBA with an SQL Server 2012 backend. It should allow the application to store open tasks with 1:n parameters in a queue table and process them later on. It deserves mentioning that for some tasks, it might take several process steps until all information needed for their processing is available.
Some more information on my current situation:
The data needs to be persisted in the database for compliance reasons
No more than 1500 tasks will be processed each day
The application will be rebuild (except for the backend), the new application will make much more heavy use of this queue functionality
The total number of different tasks to be queued, as well as the no. of parameters they might need, is unknown
My currently best approach - however in EAV schema - would consist of three tables:
1. Table "tblQueueItemType"
It contains definitions for each type (or category) of task.
It contains an id, a name and an attribute count. This attribute count defines the number of attributes for this task. I want to use it later on to ensure data consistency for all tasks with status "READY".
Example for an entry in this table:
"1", "Generate Book Database Entry", "5"
2. Table "tblQueueItemHeader"
It which represents the instantiated tasks defined in the tblQeueItemType. They have a task id, their corresponding task type defined in tblQeueItemType, a status as well as a timestamp.
The status is either OPEN (not all information available), READY (all information available to process task), and DONE (when processed).
Example for an entry in this table:
"2", "1", "OPEN"
3. Table "tblQueueItemAttribute"
It contains all the information the tasks need to be processed. It contains an id, the id of the header, an attribute type and an attribute value.
Example entries for this table:
"1","2", "Author", "H.G. Wells"
"1","2", "No. Pages", "1234"
My table definitions so far:
CREATE TABLE [dbo].[tblQueueItemType](
id INT NOT NULL IDENTITY (1,1) PRIMARY KEY,
Name NVARCHAR(20) NOT NULL,
AttributeCount INT NOT NULL
)
CREATE TABLE [dbo].[tblQueueItemHeader](
id INT NOT NULL IDENTITY (1,1) PRIMARY KEY,
QueueItemTypeId INT NOT NULL,
Status NVARCHAR(5) NOT NULL,
Timestamp DATETIME NOT NULL
CONSTRAINT QueueTypeHeader
FOREIGN KEY (QueueItemTypeId)
REFERENCES tblQueueItemType (id)
)
CREATE TABLE [dbo].[tblQueueItemAttribute](
id INT NOT NULL IDENTITY (1,1) PRIMARY KEY,
QueueItemHeaderId INT NOT NULL,
Attribute NVARCHAR(5) NOT NULL,
Value NVARCHAR(50) NOT NULL,
Timestamp DATETIME NOT NULL
CONSTRAINT QueueHeaderAttribute
FOREIGN KEY (QueueItemHeaderId)
REFERENCES tblQueueItemHeader (id)
)
ALTER TABLE tblQueueItemHeader
ADD CONSTRAINT QueueItemHeaderStatus
CHECK (Status IN ('OPEN', 'READY', 'DONE'));
Obviously the current design is suboptimal. What would be best schema for this kind of use-case? How feasible is my current approach?
Thank you very much!

how do I model subtyping in a relational schema?

Is the following DB-schema ok?
REQUEST-TABLE
REQUEST-ID | TYPE | META-1 | META-2 |
This table stores all the requests each of which has a unique REQUEST-ID. The TYPE is either A, B or C. This will tell us which table contains the specific request parameters. Other than that we have the tables for the respective types. These tables store the parameters for the respective requests. META-1 are just some additional info like timestamps and stuff.
TYPE-A-TABLE
REQUEST-ID | PARAM_X | PARAM_Y | PARAM_Z
TYPE-B-TABLE
REQUEST-ID | PARAM_I | PARAM_J
TYPE-C-TABLE
REQUEST-ID | PARAM_L | PARAM_M | PARAM_N | PARAM_O | PARAM_P | PARAM_Q
The REQUEST-ID is the foreign key into the REQUEST-TABLE.
Is this design normal/best-practice? Or is there a better/smarter way? What are the alternatives?
It somehow feels strange to me, having to do a query on the REQUEST-TABLE to find out which TYPE-TABLE contains the information I need, to then do the actual query I'm interested in.
For instance imagine a method which given an ID should retrieve the parameters. This method would need to do 2 db-access.
- Find correct table to query
- Query table to get the parameters
Note: In reality we have like 10 types of requests, i.e. 10 TYPE tables. Moreover there are many entries in each of the tables.
Meta-Note: I find it hard to come up with a proper title for this question (one that is not overly broad). Please feel free to make suggestions or edit the title.
For exclusive types, you just need to make sure rows in one type table can't reference rows in any other type table.
create table requests (
request_id integer primary key,
request_type char(1) not null
-- You could also use a table to constrain valid types.
check (request_type in ('A', 'B', 'C', 'D')),
meta_1 char(1) not null,
meta_2 char(1) not null,
-- Foreign key constraints don't reference request_id alone. If they
-- did, they might reference the wrong type.
unique (request_id, request_type)
);
You need that apparently redundant unique constraint so the pair of columns can be the target of a foreign key constraint.
create table type_a (
request_id integer not null,
request_type char(1) not null default 'A'
check (request_type = 'A'),
primary key (request_id),
foreign key (request_id, request_type)
references requests (request_id, request_type) on delete cascade,
param_x char(1) not null,
param_y char(1) not null,
param_z char(1) not null
);
The check() constraint guarantees that only 'A' can be stored in the request_type column. The foreign key constraint guarantees that each row will reference an 'A' row in the table "requests". Other type tables are similar.
create table type_b (
request_id integer not null,
request_type char(1) not null default 'B'
check (request_type = 'B'),
primary key (request_id),
foreign key (request_id, request_type)
references requests (request_id, request_type) on delete cascade,
param_i char(1) not null,
param_j char(1) not null
);
Repeat for each type table.
I usually create one updatable view for each type. The views join the table "requests" with one type table. Application code uses the views instead of the base tables. When I do that, it usually makes sense to revoke privileges on the base tables. (Not shown.)
If you don't know which type something is, then there's no alternative to running one query to get the type, and another query to select or update.
select request_type from requests where request_id = 42;
-- Say it returns 'A'. I'd use the view type_a_only.
update type_a_only
set param_x = '!' where request_id = 42;
In my own work, it's pretty rare to not know the type, but it does happen sometimes.
The phrase you may be looking for is "how do I model inheritance in a relational schema". It's been asked before. Whilst this is a reference to object oriented software design, the basic question is the same: how do I deal with data where there is a "x is a type of y" relationship.
In your case, "request" is the abstract class, and typeA, TypeB etc. are the subclasses.
Your solution is one of the classic answers - "table per subclass". It's clean and easy to maintain, but does mean you can have multiple database access requests to retrieve the data.

Cascade UPDATE to related objects

I've set up my database and application to soft delete rows. Every table has an is_active column where the values should be either TRUE or NULL. The problem I have right now is that my data is out of sync because unlike a DELETE statement, setting a value to NULL doesn't cascade to rows in separate tables for which the "deleted" row in another table is a foreign key.
I have already taken measures to correct the data by finding inactive rows from the source table and manually setting related rows in other tables to be inactive as well. I recognize that I could do this at the application level (I'm using Django/Python for this project), but I feel like this should be a database process. Is there a way to utilize something like PostgreSQL's ON UPDATE constraint so that when a row has is_active set to NULL, all rows in separate tables referencing the updated row as a foreign key automatically have is_active set to NULL as well?
Here's an example:
An assessment has many submissions. If the assessment is marked inactive, all submissions related to it should also be marked inactive.
To my mind, it doesn't make sense to use NULL to represent a Boolean value. The semantics of "is_active" suggest that the only sensible values are True and False. Also, NULL interferes with cascading updates.
So I'm not using NULL.
First, create the "parent" table with both a primary key and a unique constraint on the primary key and "is_active".
create table parent (
p_id integer primary key,
other_columns char(1) default 'x',
is_active boolean not null default true,
unique (p_id, is_deleted)
);
insert into parent (p_id) values
(1), (2), (3);
Create the child table with an "is_active" column. Declare a foreign key constraint referencing the columns in the parent table's unique constraint (last line in the CREATE TABLE statement above), and cascade updates.
create table child (
p_id integer not null,
is_active boolean not null default true,
foreign key (p_id, is_active) references parent (p_id, is_active)
on update cascade,
some_other_key_col char(1) not null default '!',
primary key (p_id, some_other_key_col)
);
insert into child (p_id, some_other_key_col) values
(1, 'a'), (1, 'b'), (2, 'a'), (2, 'c'), (2, 'd'), (3, '!');
Now you can set the "parent" to false, and that will cascade to all referencing tables.
update parent
set is_active = false
where p_id = 1;
select *
from child
order by p_id;
p_id is_active some_other_key_col
--
1 f a
1 f b
2 t a
2 t c
2 t d
3 t !
Soft deletes are a lot simpler and have much better semantics if you implement them as valid-time state tables. FWIW, I think the terms soft delete, undelete, and undo are all misleading in this context, and I think you should avoid them.
PostgreSQL's range data types are particularly useful for this kind of work. I'm using date ranges, but timestamp ranges work the same way.
For this example, I'm treating only "parent" as a valid-time state table. That means that invalidating a particular row (soft deleting a particular row) also invalidates all the rows that reference it through foreign keys. It doesn't matter whether they reference it directly or indirectly.
I'm not implementing soft deletes on "child". I can do that, but I think that would make the essential technique unreasonably hard to understand.
create extension btree_gist; -- Necessary for the kind of exclusion
-- constraint below.
create table parent (
p_id integer not null,
other_columns char(1) not null default 'x',
valid_from_to daterange not null,
primary key (p_id, valid_from_to),
-- No overlapping date ranges for a given value of p_id.
exclude using gist (p_id with =, valid_from_to with &&)
);
create table child (
p_id integer not null,
valid_from_to daterange not null,
foreign key (p_id, valid_from_to) references parent on update cascade,
other_key_columns char(1) not null default 'x',
primary key (p_id, valid_from_to, other_key_columns),
other_columns char(1) not null default 'x'
);
Insert some sample data. In PostgreSQL, the daterange data type has a special value 'infinity'. In this context, it means that the row that has the value 1 for "parent"."p_id" is valid from '2015-01-01' until forever.
insert into parent values
(1, 'x', daterange('2015-01-01', 'infinity'));
insert into child values
(1, daterange('2015-01-01', 'infinity'), 'a', 'x'),
(1, daterange('2015-01-01', 'infinity'), 'b', 'y');
This query will show you the joined rows.
select *
from parent p
left join child c
on p.p_id = c.p_id
and p.valid_from_to = c.valid_from_to;
To invalidate a row, update the date range. This row (below) was valid from '2015-01-01' to '2015-01-31'. That is, it was soft deleted on 2015-01-31.
update parent
set valid_from_to = daterange('2015-01-01', '2015-01-31')
where p_id = 1 and valid_from_to = daterange('2015-01-01', 'infinity');
Insert a new valid row for p_id 1, and pick up the child rows that were invalidated on Jan 31.
insert into parent values (1, 'r', daterange(current_date, 'infinity'));
update child set valid_from_to = daterange(current_date, 'infinity')
where p_id = 1 and valid_from_to = daterange('2015-01-01', '2015-01-31');
Richard T Snodgrass's seminal book Developing Time-Oriented Database Applications in SQL is available free from his university web page.
You can use a trigger:
CREATE OR REPLACE FUNCTION trg_upaft_upd_trip()
RETURNS TRIGGER AS
$func$
BEGIN
UPDATE submission s
SET is_active = NULL
WHERE s.assessment_id = NEW.assessment_id
AND NEW.is_active IS NULL; -- recheck to be sure
RETURN NEW; -- call this BEFORE UPDATE
END
$func$ LANGUAGE plpgsql;
CREATE TRIGGER upaft_upd_trip
BEFORE UPDATE ON assessment
FOR EACH ROW
WHEN (OLD.is_active AND NEW.is_active IS NULL)
EXECUTE PROCEDURE trg_upaft_upd_trip();
Related:
How do I make a trigger to update a column in another table?
Be aware that a trigger has more possible points of failure than a FK constraints with ON UPDATE CASCADE ON DELETE CASCADE.
#Mike added a solution with a multi-column FK constraint I would consider as alternative.
Related answer on dba.SE:
Enforcing constraints “two tables away”
Related answer one week later:
Cross table constraints in PostgreSQL
This is more a schematic problem than a procedural one.
You may have dodged creating a solid definition of "what constitutes a record". At the moment you have object A that may be referenced by object B, and when A is "deleted" (has its is_active column set to FALSE, or NULL, in your current case) B is not reflecting that. It sounds like this is a single table (you only mention rows, not separate classes or tables...) and you have a hierarchical model formed by self-reference. If that is the case you can think of the problem in a few ways:
Recursive lineage
In this model you have one table that contains all the data in one place, whether its a parent, a child, etc. and you check the table for recursive references to traverse the tree.
It is tricky to do this properly in an ORM that lacks explicit support for this without accidentally writing routines that either:
iteratively pound the crap out of your DB by making at least one query per node or
pulling the entire table at once and traversing it in application code
It is, however, straightforward to do this in Postgres and let Django access it via a model over an unmanaged view on the lineage query you build. (I wrote a little about this once.) Under this model your query will descend the tree until it hits the first row of the current branch that is marked as not active and stop, thus effectively truncating all the rows below associated with that one (no need for propagating the is_active column!).
If this were, say, a blog entry + comments within the same structure (a fairly common CMS schema) then any row that is its own parent is a primary entity and anything that has a parent that is not itself is a comment. To remove a whole blog post + its children you mark just the blog post's row as inactive; to remove a thread within the comments mark as inactive the comment that begins that thread.
For a blog + comments type feature this is usually the most straightforward way to do things -- though most CMS systems get it wrong (but usually only in ways that matter if you start doing serious data stuff later, if you're just setting up some place for people to argue on the internet then Worse is Better).
Recursive lineage + External "record" definition
In this model you have your tree of nodes and your primary entities separated. The primary entities are marked as being active or not, and that attribute is common to all the elements that are related to it within the context of that primary entity (they exist and have a meaning independent of it). This means two tables, one for primary entities, and one for your tree of nodes.
Use this when you have something more interesting going on than simply threaded discussion. For example, a model of components where a tree of things may be aggregated separately into other larger things, and you need to have a way to mark those "other larger things" as active or not independently of the components themselves.
Further down the rabbit hole...
There are other takes on this idea, but they get increasingly non-trivial, which is probably not suitable. For example, consider a third basic take on this model where the hierarchy structure, the node bodies, and the primary entities are all separated into different tables. One node body might appear in multiple trees by reference, and multiple trees may be considered active or inactive in the context of a single primary entity, etc.
Consider heading this direction if your data is more complex. If you wind up really needing models this far decomposed ("normalized") then I would caution that any ORM is probably going to wind up being a lot more trouble than its worth -- you will start running headlong into the problem that ORMs are fundamentally leaky abstractions (1 object can never really equate to 1 table...).

Database best practices

I have a table which stores comments, the comment can either come from another user, or another profile which are separate entities in this app.
My original thinking was that the table would have both user_id and profile_id fields, so if a user submits a comment, it gives the user_id leaves the profile_id blank
is this right, wrong, is there a better way?
Whatever is the best solution depends IMHO on more than just the table, but also how this is used elsewhere in the application.
Assuming that the comments are all associated with some other object, lets say you extract all the comments from that object. In your proposed design, extracting all the comments require selecting from just one table, which is efficient. But that is extracting the comments without extracting the information about the poster of each comment. Maybe you don't want to show it, or maybe they are already cached in memory.
But what if you had to retrieve information about the poster while retrieving the comments? Then you have to join with two different tables, and now the resulting record set is getting polluted with a lot of NULL values (for a profile comment, all the user fields will be NULL). The code that has to parse this result set also could get more complex.
Personally, I would probably start with the fully normalized version, and then denormalize when I start seeing performance problems
There is also a completely different possible solution to the problem, but this depends on whether or not it makes sense in the domain. What if there are other places in the application where a user and a poster can be used interchangeably? What if a User is just a special kind of a Profile? Then I think that the solution should be solved generally in the user/profile tables. For example (some abbreviated pseudo-sql):
create table AbstractProfile (ID primary key, type ) -- type can be 'user' or 'profile'
create table User(ProfileID primary key references AbstractProfile , ...)
create table Profile(ProfileID primary key references AbstractProfile , ...)
Then any place in your application, where a user or a profile can be used interchangeably, you can reference the LoginID.
If the comments are general for several objects you could create a table for each object:
user_comments (user_id, comment_id)
profile_comments (profile_id, comment_id)
Then you do not have to have any empty columns in your comments table. It will also make it easy to add new comment-source-objects in the future without touching the comments table.
Another way to solve is to always denormalize (copy) the name of the commenter on the comment and also store a reference back to the commenter via a type and an id field. That way you have a unified comments table where on you can search, sort and trim quickly. The drawback is that there isn't any real FK relationship between a comment and it's owner.
In the past I have used a centralized comments table and had a field for the fk_table it is referencing.
eg:
comments(id,fk_id,fk_table,comment_text)
That way you can use UNION queries to concatenate the data from several sources.
SELECT c.comment_text FROM comment c JOIN user u ON u.id=c.fk_id WHERE c.fk_table="user"
UNION ALL
SELECT c.comment_text FROM comment c JOIN profile p ON p.id=c.fk_id WHERE c.fk_table="profile"
This ensures that you can expand the number of objects that have comments without creating redundant tables.
Here's another approach, which allows you to maintain referential integrity through foreign keys, manage centrally, and provide the highest performance using standard database tools such as indexes and if you really need, partitioning etc:
create table actor_master_table(
type char(1) not null, /* e.g. 'u' or 'p' for user / profile */
id varchar(20) not null, /* e.g. 'someuser' or 'someprofile' */
primary key(type, id)
);
create table user(
type char(1) not null,
id varchar(20) not null,
...
check (id = 'u'),
foreign key (type, id) references actor_master_table(type, id)
);
create table profile(
type char(1) not null,
id varchar(20) not null,
...
check (id = 'p'),
foreign key (type, id) references actor_master_table(type, id)
);
create table comment(
creator_type char(1) not null,
creator_id varchar(20) not null,
comment text not null,
foreign key(creator_type, creator_id) references actor_master_table(type, id)
);

Resources