Modeling a data-store browser - database

I have a connection-object browser that I want to allow a user to view of various data sources they are connected to.The viewer of the objects look something like this:
Connection: Remote.1234.MySQL (3 level source)
Database: Sales
Table: User
Field: Name -- CHAR(80)
Field: Age -- INT32
Table: Product
...
Table: Purchase
...
Database: Other
...
Connection: Remote.abc.ElasticSearch (2 level source)
Index: Inventory
Field: ID -- INTEGER
Field: Product -- STRING
...
Connection: Local.xyz.MongoDB (3 level source)
Database: Mail
Collection: Users
Field: MailboxID -- INTEGER
Field: Name -- STRING
Collection: Documents
...
Connection: Local.xyz.SQLServer (4 level source)
Database: Main
Schema: Public
Table: user
Field: Name -- STRING
Database: History
...
In other words, a 'Source' is a hierarchy of a known number of levels and a known 'name' for each level. While the entire hierarchy is variable, the hierarchy of any given source will always have the same number of levels and name. What might be a good way to model this relationally? My thought was to have the following:
Connection:
id
host
(other details)
SourceType:
id
Name
SourceTypeLevelMapping:
SourceTypeID
level (int)
name
ThreeLevelSource_Level1: # e.g., Database
ID
ParentID (ConnectionID)
Name
(other details)
ThreeLevelSource_Level2: # e.g., Table
ID
ParentID (Level1ID)
Name
(other details)
ThreeLevelSource_Level3: # e.g., Field
ID
ParentID (Level2ID)
FieldName
FieldType
(other details)
Then do the same for the other level-ed hierarchies:
TwoLevelSource_Level1, TwoLevelSource_Level2
FourLevelSource_Level1, FourLevelSource_Level2, FourLevelSource_Level3, FourLevelSource_Level4
So basically define the known hierarchies and for each new source that we add we would attach it to one of the known hierarchy levels. The alternative approach I was thinking of doing is to create a new hierarchy for each new source, but then we would be looking at literally hundreds of tables if we were to allow access to 25-50 sources.
What might be a good way to model this type of hierarchical data?
(Also, yes I am familiar with the existing general approaches for modeling hierarchical data as delineated here -- What are the options for storing hierarchical data in a relational database?, How can you represent inheritance in a database? -- the below is not a duplicate.)

Relational Solution
Responding to the relational-database and hierarchic-data tags, the latter being pedestrian in the former.
1.1 Preliminary
Due to the requirement for, and the difference between:
the genuine SQL Platforms (conformation to the Standard; server architecture, unified language; etc) and
the pretend "SQL" programs (no architecture; bits of language spread across those programs; no Transactions; no ACID; etc) that provide no compliance to the Standard, and therefore use the term incorrectly, and
the non-SQLs
Thus I use Record and Field to cover all possibilities, instead of the Relational terms, which would convey Relational definitions.
All possibilities are catered for, but a Relational and SQL-compliant approach (eg. MS SQL Server) is taken as the best method, due to its 40-year establishment and maturity, and the absence of alternatives.
The collection of SQL Platforms; pretend "SQL" applications; and non-SQL suites, are labelled DataSource.
1.2 Compliance
This solution is 100% Relational: Codd's Relational Model, not the substandard alternatives marketed by the academics as "relational":
It can be implemented in any SQL compliant Platform
It has Relational Integrity (which is logical, beyond Referential Integrity, which is SQL and physical); Relational Power; and Relational Speed.
All update interaction is simple, via SQL ACID Transactions.
No warranties are given for pretend "SQLs" and non-SQLs.
2 Solution
2.1 Concept
I appreciate that as a developer, your focus in on the data values and how to retrieve it. However, two levels of definition are required first, in order to support the third data level:
Catalogue Potential
Blue (Reference cluster).
The DtaSources and definition that is available in the market, which the organisation might use. Let's say 42, as per your descriptions.
I would entrust this only to a developer, not an user_admin, because the set up it critical (the lower levels depend on it), and it describes the physical capability and limitations of each DataSource.
Catalogue Actual
Green (Identification cluster).
The DataSources and definition that are actually contracted and used by the organisation. Let's say 12. At this point we have connection addresses; ports; and users. It is constrained CataloguePotential, directly, and via CHECKS that call Functions.
This level defines the content (the tables that actually exist), it contains no data values.
Maintaining an SQL mindset, because that would be the most prudent, given that it is an established Standard, with 40 years of maturity, because it gives us the most flexibility: the CatalogueActual forms the SQL Catalogue.
Likewise, I have used the terms Record and Field for the objects in the collective, rather than Table and Column, which would imply Relational and SQL meanings.
SQL Platform
This level can be populated automatically by the program querying the SQL Catalogue.
"SQL" applications and non-SQL suites
Population is manual due to the absence of a Catalogue. It can be done by an user_admin. The constraint would be your program attempting a trial query to validate the user-supplied table definition.
Current Data
Yellow (Transaction cluster)
The current data, that the user has queried from the DataSources, via his Connection, for the webpage. The assumption is, I have taken the user::webpage to be central, and governing (one user per Connection; one user per webpage), not the OO Object.
if the OO Objects are not reliable (depends on the library you use), or there is one set of Objects across all user-webpages, more Constraints need to be added.
2.2 Method
You need:
Simple Hierarchy
a single-parent hierarchy to replicate the fixed levels of definition in the Catalogue in the SQL servers, as well as the variable levels in the constructed catalogue for the pretend "SQLs" and the non-SQLs.
Relational Hierarchies are fully defined, along with SQL implementation details, in the Hierarchy doc. The simple or single-parent model is given in [§ 2.2].
The Root level (not the Anchor) is the Potential DataSource
The Leaf level is that which contains data, either a Record or a Struct (for those in the collective that allow one).
In the Potential Datasource, it is representative, truly a RecordType and FieldType
In the Actual DataSource, it is an actual Record, which is an instance of RecordType, and actual Field, which is a narrower definition of FieldType.
Method/Struct
In order to handle a Struct, which in definition terms is identical to a Record, and to allow a Struct to contain a Struct, we need a level of abstraction, which is ...
Article
is either
a Field, which is the atomic unit of storage, xor
a Struct, which contains Articles
that requires an Exclusive Subtype cluster, fully defined along with SQL implementation details, in the Subtype doc
Method/Array
To support an Array of Fields:
These are multi-valued dependencies on Field, thus implemented as child tables.
For scalars the NumElement is 1.
That makes the Exclusive Subtype cluster on Field that is otherwise required for scalars redundant.
2.3 Relational Data Model
This is the progress after seven iterations.  It shows the Table-Relation level (the Attribute level is too large for an inline graphic).
Assumption
That the JS (or whatever) objects are local to the webpage/user.  If your objects are global, the value tables need to be constrained to Connection.
The data model is given in a single PDF:
Table Relation level
Table Relation level + sample data
Table Attribute level + sample data.
2.4 Notation
All my data models are rendered in IDEF1X, available from the early 1980's, the one and only notation for Relational Data Modelling, the Standard since 1993.
The IDEF1X Introduction is essential reading for those who are new to Codd's Relational Model, or its modelling method. Note that IDEF1X models complete, they are rich in detail and precision, showing all required details, whereas a home-grown model, being unaware of the imperatives of the Standard, have far less definition. Which means, the notation needs to be fully understood.

Here three working sqlite flavored implementations (once sqlite is being used, column types not being enforced are acceptable, only integer primary keys were used in order to act as rowid):
In all cases, sqlite foreign key PRAGMA is set to true: PRAGMA foreign_keys = 1;
Simple implementation - one fixed table for each source/level (constrained by foreign keys)
The following design/implementation utilizes one table for each type of database and level. Tables references one each other with foreign keys to ensure correctness. For example, a mongo collection can't be child of a mysql database. Only in the connection level all database types share the same table, but it could be different if it is expected different properties for each kind of connection.
create table databasetype(name primary key) without rowid;
insert into databasetype values ('mysql'),('elasticsearch'),('mongo'),('sqlserver');
create table datatype(name primary key) without rowid;
insert into datatype values ('int'),('str'); -- you can differentiate varchar if you will
create table connection(id integer, hostname, databasetype, primary key(id), foreign key(databasetype) references databasetype(name));
create table mysqldatabase(id integer, connectionid, name, primary key(id), foreign key(connectionid) references connection(id));
create table mysqltable(id integer, databaseid, name, primary key(id), foreign key(databaseid) references mysqldatabase(id));
create table mysqlfield(id integer, tableid, name, datatype, datalength, primary key(id), foreign key(tableid) references mysqltable(id), foreign key(datatype) references datatype(name));
create table elasticsearchindex(id integer, connectionid, name, primary key(id), foreign key(connectionid) references connection(id));
create table elasticsearchfield(id integer, indexid, name, datatype, datalength, primary key(id), foreign key(indexid) references mysqltable(id), foreign key(datatype) references datatype(name));
create table mongodatabase(id integer, connectionid, name, primary key(id), foreign key(connectionid) references connection(id));
create table mongocollection(id integer, databaseid, name, primary key(id), foreign key(databaseid) references mongodatabase(id));
create table mongofield(id integer, collectionid, name, datatype, datalength, primary key(id), foreign key(collectionid) references mongocollection(id), foreign key(datatype) references datatype(name));
create table sqlserverdatabase(id integer, connectionid, name, primary key(id), foreign key(connectionid) references connection(id));
create table sqlserverschema(id integer, databaseid, name, primary key(id), foreign key(databaseid) references sqlserverdatabase(id));
create table sqlservertable(id integer, schemaid, name, primary key(id), foreign key(schemaid) references sqlserverschema(id));
create table sqlserverfield(id integer, tableid, name, datatype, datalength, primary key(id), foreign key(tableid) references sqlservertable(id), foreign key(datatype) references datatype(name));
Loading data representing the first table:
insert into connection(hostname, databasetype) values ('remote:1234', 'mysql');
insert into mysqldatabase(connectionid, name) select id, 'sales' from connection where hostname='remote:1234';
insert into mysqltable(databaseid, name) select id, 'user' from mysqltable where name='sales';
insert into mysqlfield(tableid, name, datatype, datalength) select id, 'name', 'str', 80 from mysqldatabase where name='product';
insert into mysqlfield(tableid, name, datatype) select id, 'age', 'i32' from mysqldatabase where name='product';
Trying invalid manipulations of data:
insert into mysqlfield(tableid, name, datatype) values (2, 'newfield', 'qubit');
-- Error: FOREIGN KEY constraint failed
In order to pretty-print the whole tree it is necessary to do a manual join of all tables involved.
Graph like implementation - one table representing the tree, other the hierarchy (constrained by triggers)
Here the element table is used to represent each element/node in the tree. Its level column explicitly classifies each element as an database, table, etc. Here sqlite's rowid is being used as the primary key, but it is easy to change it to a regular id.
In the previous implementation, foreign keys were used to ensure model correctness. Now triggers are used for this job. They decide which parent level accepts which child level, as it is allowed for the respective dbtype - those rules are specified on the element_type table.
Lastly, an exra table element_properties, is used to allow extra properties to be attached to any elements, such as field type.
create table db_type(name primary key) without rowid;
insert into db_type values ('mysql'),('elasticsearch'),('mongo'),('sqlserver');
create table element_type(parentlevel, childlevel, dbtype, primary key(parentlevel, childlevel, dbtype), foreign key(dbtype) references db_type(name)); --not using without rowid to be able to have null parent level
insert into element_type values
(null, 'connection', 'mysql'),
('connection', 'database', 'mysql'),
('database', 'table', 'mysql'),
('table', 'field', 'mysql'),
(null, 'connection', 'elasticsearch'),
('connection', 'index', 'elasticsearch'),
('index','field', 'elasticsearch'),
(null, 'connection', 'mongo'),
('connection', 'database', 'mongo'),
('database', 'collection', 'mongo'),
('collection', 'field', 'mongo'),
(null, 'connection', 'sqlserver'),
('connection', 'database', 'sqlserver'),
('database', 'schema', 'sqlserver'),
('schema', 'table', 'sqlserver'),
('table', 'field', 'sqlserver');
create table element(id integer, parentid, name, level, dbtype, primary key(id), foreign key(parentid) references element(id), foreign key(dbtype) references db_type(name));
create table element_property(parentid, name, value, primary key(parentid, name), foreign key(parentid) references element(id)) without rowid;
-- trigger to guarantee that new elements will conform hierarchy
create trigger element_insert before insert on element
begin
select iif(count(*)>0, 'ok', raise(abort,'invalid parent-child insertion')) from element_type etc join element_type etp on (etp.childlevel, etp.dbtype)=(etc.parentlevel, etc.dbtype) where (etc.dbtype, etc.parentlevel, etc.childlevel)=(new.dbtype, (select level from element ei where ei.rowid=new.parentid), new.level);
end;
-- trigger to guarantee that updated elements will conform hierarchy
create trigger element_update before update on element
begin
select iif(count(*)>0, 'ok', raise(abort,'invalid parent-child update')) from element_type etc join element_type etp on (etp.childlevel, etp.dbtype)=(etc.parentlevel, etc.dbtype) where (etc.dbtype, etc.parentlevel, etc.childlevel)=(new.dbtype, (select level from element ei where ei.rowid=new.parentid), new.level);
end;
-- trigger to guarantee that hierarchy removal must respect existing elements (no delete cascade used)
create trigger element_type_delete before delete on element_type
begin
select iif(count(*)>0, raise(abort,'can''t remove, entries found in the element table using this relationship'), 'ok') from element etc join element etp on etp.rowid=etc.parentid and etp.dbtype=etp.dbtype where etc.dbtype=old.dbtype and (etp.level,etc.level)=(old.parentlevel, old.childlevel);
end;
-- trigger to guarantee that hierarchy changes must respect existing elements
create trigger element_type_update before update on element_type
begin
select iif(count(*)>0, raise(abort,'can''t change, entries found in the element table using this relationship'), 'ok') from element etc join element etp on etp.rowid=etc.parentid and etp.dbtype=etp.dbtype where etc.dbtype=old.dbtype and (etp.level,etc.level)=(old.parentlevel, old.childlevel) and (etp.level,etc.level)!=(new.parentlevel, new.childlevel);
end;
Loading data representing the first table:
insert into element(name, level, dbtype) values ('remote:1234', 'connection', 'mysql');
insert into element(name, level, dbtype, parentid) values ('sales', 'database', 'mysql', (select id from element where (level, name, dbtype)=('connection', 'remote:1234', 'mysql')));
insert into element(name, level, dbtype, parentid) values ('user', 'table', 'mysql', (select id from element where (level, name, dbtype)=('database', 'sales', 'mysql')));
insert into element(name, level, dbtype, parentid) values ('name', 'field', 'mysql', (select id from element where (level, name, dbtype)=('table', 'user', 'mysql')));
insert into element(name, level, dbtype, parentid) values ('age', 'field', 'mysql', (select id from element where (level, name, dbtype)=('table', 'user', 'mysql')));
insert into element_property(name, value, parentid) values ('fieldtype', 'varchar', (select id from element where (level, name, dbtype)=('field', 'name', 'mysql')));
insert into element_property(name, value, parentid) values ('fieldlength', 80, (select id from element where (level, name, dbtype)=('field', 'name', 'mysql')));
insert into element_property(name, value, parentid) values ('fieldtype', 'integer', (select id from element where (level, name, dbtype)=('field', 'age', 'mysql')));
Trying invalid manipulations of data:
insert into element(name, level, dbtype, parentid) values ('documents', 'collection', 'mysql', (select id from element where (level, name, dbtype)=('database', 'sales', 'mysql')));
-- Error: invalid parent-child insertion
update element_type set childlevel='specialfield' where dbtype='mysql' and (parentlevel, childlevel)=('table','field');
-- Error: can't change, entries found in the element table using this relationship
Pretty-printing the tree:
create view elementree(path) as
with recursive cte(id, name, depth, dbtype, level) as (
select id, name, 0 as depth, dbtype, level from element where parentid is null
union all
select el.id, el.name, cte.depth+1 as depth, el.dbtype, el.level from element el join cte on el.parentid=cte.id
order by depth desc
)
select substring(' ',0,2*depth)||name||' ('||dbtype||'-'||level||')' from cte;
select * from elementree;
-- remote:1234 (mysql-connection)
-- sales (mysql-database)
-- user (mysql-table)
-- documents (mysql-table)
-- name (mysql-field)
-- age (mysql-field)
Minimalist DRY graph like implementation - one table with only names representing the tree and only one auxiliar table
Here again it is used an element table to represent each element in the tree. Differently from the previous case, the table has less information and the type of each element - whether it is a database or a table is implicitly inferred instead of explicitly determined by a column. By simply adding an user as a child of sales, it is inferred that user is a mysql table, once it is child of a mysql database - sales, which is adatabase because it is child of a mysql connection, which is child of the mysql root element. Dbtypes are root elements in this tree, all their children are inferred to be of this dbtype.
Here the hierarchypath table was used to tell the hierarchy that has be followed in the element tree. For the user confort, (s)he will only have to insert a (> separated) string representing the hierarchy path, starting from dbtype. The hierarchy view will desconstruct this string to the hierachy structure. One example of a hirearcy path would be: mysql>connection>database>table>field.
Note that again, sqlite's rowid is used as table id. Remember that it is not possible to see rowid by simply select * from table;, it is hidden by default, it is needed to explicitly select it: select rowid,* from table;.
create table element(name, parentrowid, foreign key(parentrowid) references element(rowid));
-- dbtypes are the root elements
insert into element(name) values ('mysql'),('elasticsearch'),('mongo'),('sqlserver');
create table hierarchypath(path);
insert into hierarchypath values
('mysql>connection>database>table>field'),
('elasticsearch>connection>index>field'),
('mongo>connection>database>collection>field'),
('sqlserver>connection>schema>database>table>field');
Loading data:
insert into element select 'remote:1234',rowid from element where (name,coalesce(parentrowid,-1))=('mysql',-1); --returning rowid; -- returning only works for sqlite 3.35+
insert into element select 'sales',rowid from element where rowid=5;
insert into element select 'user',rowid from element where rowid=6;
insert into element select 'name',rowid from element where rowid=7;
insert into element select 'age',rowid from element where rowid=7;
Pretty-printing:
create view hierarchy(root, depth, name) as
with recursive hierarchycte(root, depth, name, remaining) as (
select substr(path, 0, instr(path, '>')) as root, 0 as depth, substr(path, 0, instr(path, '>')) as name, substr(path, instr(path, '>')+1)||'>' as remaining from hierarchypath
union all
select root, depth+1 as depth, substr(remaining, 0, instr(remaining, '>')) as name, substr(remaining, instr(remaining, '>')+1) as remaining from hierarchycte where instr(remaining, '>') > 0
)
select root, depth, name from hierarchycte where depth>=0;
create view elementhierarchy(root, depth, name) as
with recursive elementcte(root, depth, name, rowid, parentrowid) as (
select name as root, 0 as depth, name, rowid, parentrowid from element where parentrowid is null
union all
select elcte.root, elcte.depth+1, el.name, el.rowid, el.parentrowid from elementcte elcte join element el on el.parentrowid=elcte.rowid
order by depth desc
)
select root, depth, name from elementcte;
create view elementree as
with recursive elementcte(root, depth, name, rowid, parentrowid) as (
select name as root, 0 as depth, name, rowid, parentrowid from element where parentrowid is null
union all
select elcte.root, elcte.depth+1, el.name, el.rowid, el.parentrowid from elementcte elcte join element el on el.parentrowid=elcte.rowid
order by depth desc
)
select substring(' ',0,2*h.depth-2)||eh.name||' ('||h.root||'-'||h.name||')' from (select *,row_number() over () as originalorder from elementhierarchy) eh join hierarchy h on (eh.root,eh.depth)=(h.root,h.depth) where h.depth>0 order by originalorder;
select * from elementree;
-- remote:1234 (mysql-connection)
-- sales (mysql-database)
-- user (mysql-table)
-- age (mysql-field)
-- name (mysql-field)
Triggers were not implemented here, but it would be good to do so. One example would be to avoid inserting more levels than allowed.
It would be wiser to store the hierarchy in the desconstructed form seen on the view
hierarchy, by doing the desconstruction in insertion time instead of every select query to avoid cpu consumption. Here it was left this way to differentiate it more from other implementations.
Here the last level entity, the field have no properties as shown on previous implementations. In this model it would be necessary to add one or two extra levels to the hierarchy: ...table>field>fieldpropertyandvalue or ...table>field>fieldproperty>fieldpropertyvalue, in the first case an example of fieldpropertyandvalue would be datatype=integer and an example of separated property and values would be respectively datatype and integer. This approach where any properties are new nodes in the graph is closer to the approach used by RDF stores.
To conclude it must be stated that it would be possible to use specialized graph databases, using their own query languages like cypher in neo4j and sparql in others or even other languages, but since the graph design overall is simple, a relational database suffice our needs.

Related

Dynamic schema changes in Cassandra

I have lots of users(150-200 million). Each user has N(30-100) attributes. The attribute can be of type integer, text or timestamp. Attributes are not known, so I want to add them dynamically, on the fly.
Solution 1 - Add new column by altering the table
CREATE TABLE USER_PROFILE(
UID uuid PRIMARY KEY,
LAST_UPDATE_DATE TIMESTAMP,
CREATION_DATE TIMESTAMP
);
For each new attribute:
ALTER TABLE USER_PROFILE ADD AGE INT;
INSERT INTO USER_PROFILE ( UID, LAST_UPDATE_DATE, CREATION_DATE, AGE) VALUES ('01f63e8b-db53-44ef-924e-7a3ccfaeec28', 2021-01-12 07:34:19.121, 2021-01-12 07:34:19.121, 27);
Solution 2 - Fixed schema:
CREATE TABLE USER_PROFILE(
UID uuid,
ATTRIBUTE_NAME TEXT,
ATTRIBUTE_VALUE_TEXT TEXT,
ATTRIBUTE_VALUE_TIMESTAMP TIMESTAMP,
ATTRIBUTE_VALUE_INT INT,
LAST_UPDATE_DATE TIMESTAMP,
CREATION_DATE TIMESTAMP,
PRIMARY KEY (UID, ATTRIBUTE_NAME)
);
For each new attribute:
INSERT INTO USER_PROFILE ( UID, ATTRIBUTE_NAME, ATTRIBUTE_VALUE_INT, LAST_UPDATE_DATE, CREATION_DATE) VALUES ('01f63e8b-db53-44ef-924e-7a3ccfaeec28', 'age', 27, 2021-01-12 07:34:19.121, 2021-01-12 07:34:19.121, 27);
Which is the best solution in terms of performance?
I would personally go with the 2nd solution - having columns for each data type that is used, and use the attribute name as the last component of the primary key (see examples in my previous answers on that topic:
Cassandra dynamic column family
How to handle Dynamic columns in Cassandra
How to handle Dynamic columns in Cassandra
How to understand the 'Flexible schema' in Cassandra?
First solution has following problems:
If you do schema modification from the code, then you need to coordinate these changes, otherwise you will get schema disagreement that will must be resolved by admins by restarting the nodes. And coordinated change will either slowdown the data insertion, or will create a single point of failure
Existence of many columns has significant performance impact. For example, per this very good analysis by The Last Pickle, having 100 columns instead of 10 increases read latency more than 10 times
You can't change attribute type if you'll need - in the solution with attribute as clustering column, you can just start to put attribute as another type. If you have attribute as column, you can't do that, because Cassandra doesn't allow to change column type (don't try to drop column & add it back with the new type - you'll corrupt your existing data). So you will need to create a completely new column for that attribute.

How to implement relationships with inherits at the parent table to the children tables

I'm trying to implement a database that has inheritance between some tables, there is three tables involved in the question problem: Customers, Users and Addresses (actually there is more tables involved, but with the same problem, so..).
The Customers table inherits from Users table, and the Users table has a relationship with the Addresses table (1 to many, respectively).
So My problem is that I want that table 'Customers' to has the same relationship that 'Users' has with 'Addresses', cause Customers is inherits from it. I also try to insert data to 'Addresses' with an ID from 'Customers', but this give an foreign key constraint violation, the value doesn't exists in table "myDb.users" error
this is a image of my modeling:
(Note: I'm actually using PostgreSQL, I'm just using the ADO.NET to modeling, and I know a way to get around this, but if has no way by inheritance I will change the entire DB to full relational-database.)
I assume that you're using PostgreSQL table inheritance which, unfortunately, doesn't work quite as we would expect. In particular, although records from child tables appear in selects from parent table, they are not physically recorded there, and thus their ids can't be used in foreign keys referencing parent tables.
You may consider implementing inheritance using classic approach:
CREATE TABLE Users(id INT PRIMARY KEY, user_property INT);
CREATE TABLE Customers(id INT PRIMARY KEY REFERENCES Users, customer_property INT);
CREATE TABLE Addresses(user_id INT REFERENCES Users, address TEXT);
This way you physically store properties of Customer in two tables, and you are sure that for every Customer there is a record in Users table which can be referenced from other tables.
-- inserting customer with id=1, user_property=10, customer_property=20
INSERT INTO Users(id, user_property) VALUES (1, 10);
INSERT INTO Customers(id, customer_property) VALUES (1, 20);
-- Inserting address
INSERT INTO Addresses(user_id, address) VALUES (1, 'Wall Street');
The drawback is that you need to join Users and Customers if you want to get all properties of a single customer from both tables:
-- All customer properties
SELECT * FROM Customers JOIN Users USING(id) WHERE Customers.id=1;
-- Customer and address
SELECT * FROM Customers JOIN Users USING(id) JOIN Addresses ON Users.id=Addresses.user_id WHERE Customers.id=1;

Postgres INSERT INTO... SELECT violates foreign key constraint

I'm having a really, really strange issue with postgres. I'm trying to generate GUIDs for business objects in my database, and I'm using a new schema for this. I've done this with several business objects already; the code I'm using here has been tested and has worked in other scenarios.
Here's the definition for the new table:
CREATE TABLE guid.public_obj
(
guid uuid NOT NULL DEFAULT uuid_generate_v4(),
id integer NOT NULL,
CONSTRAINT obj_guid_pkey PRIMARY KEY (guid),
CONSTRAINT obj_id_fkey FOREIGN KEY (id)
REFERENCES obj (obj_id)
ON UPDATE CASCADE ON DELETE CASCADE
)
However when I try to backfill this using the following code, I get a SQL state 23503 claiming that I'm violating the foreign key constraint.
INSERT INTO guid.public_obj (guid, id)
SELECT uuid_generate_v4(), o.obj_id
FROM obj o;
ERROR: insert or update on table "public_obj" violates foreign key constraint "obj_id_fkey"
SQL state: 23503
Detail: Key (id)=(-2) is not present in table "obj".
However, if I do a SELECT on the source table, the value is definitely present:
SELECT uuid_generate_v4(), o.obj_id
FROM obj o
WHERE obj_id = -2;
"0f218286-5b55-4836-8d70-54cfb117d836";-2
I'm baffled as to why postgres might think I'm violating the fkey constraint when I'm pulling the value directly out of the corresponding table. The only constraint on obj_id in the source table definition is that it's the primary key. It's defined as a serial; the select returns it as an integer. Please help!
Okay, apparently the reason this is failing is because unbeknownst to me the table (which, I stress, does not contain many elements) is partitioned. If I do a SELECT COUNT(*) FROM obj; it returns 348, but if I do a SELECT COUNT(*) FROM ONLY obj; it returns 44. Thus, there are two problems: first, some of the data in the table has not been partitioned correctly (there exists unpartitioned data in the parent table), and second, the data I'm interested in is split out across multiple child tables and the fkey constraint on the parent table fails because the data isn't actually in the parent table. (As a note, this is not my architecture; I'm having to work with something that's been around for quite some time.)
The partitioning is by implicit type (there are three partitions, each of which contains rows relating to a specific subtype of obj) and I think the eventual solution is going to be creating GUID tables for each of the subtypes. I'm going to have to handle the stuff that's actually in the obj table probably by selecting it into a temp table, dropping the rows from the obj table, then reinserting them so that they can be partitioned properly.

Cascade UPDATE to related objects

I've set up my database and application to soft delete rows. Every table has an is_active column where the values should be either TRUE or NULL. The problem I have right now is that my data is out of sync because unlike a DELETE statement, setting a value to NULL doesn't cascade to rows in separate tables for which the "deleted" row in another table is a foreign key.
I have already taken measures to correct the data by finding inactive rows from the source table and manually setting related rows in other tables to be inactive as well. I recognize that I could do this at the application level (I'm using Django/Python for this project), but I feel like this should be a database process. Is there a way to utilize something like PostgreSQL's ON UPDATE constraint so that when a row has is_active set to NULL, all rows in separate tables referencing the updated row as a foreign key automatically have is_active set to NULL as well?
Here's an example:
An assessment has many submissions. If the assessment is marked inactive, all submissions related to it should also be marked inactive.
To my mind, it doesn't make sense to use NULL to represent a Boolean value. The semantics of "is_active" suggest that the only sensible values are True and False. Also, NULL interferes with cascading updates.
So I'm not using NULL.
First, create the "parent" table with both a primary key and a unique constraint on the primary key and "is_active".
create table parent (
p_id integer primary key,
other_columns char(1) default 'x',
is_active boolean not null default true,
unique (p_id, is_deleted)
);
insert into parent (p_id) values
(1), (2), (3);
Create the child table with an "is_active" column. Declare a foreign key constraint referencing the columns in the parent table's unique constraint (last line in the CREATE TABLE statement above), and cascade updates.
create table child (
p_id integer not null,
is_active boolean not null default true,
foreign key (p_id, is_active) references parent (p_id, is_active)
on update cascade,
some_other_key_col char(1) not null default '!',
primary key (p_id, some_other_key_col)
);
insert into child (p_id, some_other_key_col) values
(1, 'a'), (1, 'b'), (2, 'a'), (2, 'c'), (2, 'd'), (3, '!');
Now you can set the "parent" to false, and that will cascade to all referencing tables.
update parent
set is_active = false
where p_id = 1;
select *
from child
order by p_id;
p_id is_active some_other_key_col
--
1 f a
1 f b
2 t a
2 t c
2 t d
3 t !
Soft deletes are a lot simpler and have much better semantics if you implement them as valid-time state tables. FWIW, I think the terms soft delete, undelete, and undo are all misleading in this context, and I think you should avoid them.
PostgreSQL's range data types are particularly useful for this kind of work. I'm using date ranges, but timestamp ranges work the same way.
For this example, I'm treating only "parent" as a valid-time state table. That means that invalidating a particular row (soft deleting a particular row) also invalidates all the rows that reference it through foreign keys. It doesn't matter whether they reference it directly or indirectly.
I'm not implementing soft deletes on "child". I can do that, but I think that would make the essential technique unreasonably hard to understand.
create extension btree_gist; -- Necessary for the kind of exclusion
-- constraint below.
create table parent (
p_id integer not null,
other_columns char(1) not null default 'x',
valid_from_to daterange not null,
primary key (p_id, valid_from_to),
-- No overlapping date ranges for a given value of p_id.
exclude using gist (p_id with =, valid_from_to with &&)
);
create table child (
p_id integer not null,
valid_from_to daterange not null,
foreign key (p_id, valid_from_to) references parent on update cascade,
other_key_columns char(1) not null default 'x',
primary key (p_id, valid_from_to, other_key_columns),
other_columns char(1) not null default 'x'
);
Insert some sample data. In PostgreSQL, the daterange data type has a special value 'infinity'. In this context, it means that the row that has the value 1 for "parent"."p_id" is valid from '2015-01-01' until forever.
insert into parent values
(1, 'x', daterange('2015-01-01', 'infinity'));
insert into child values
(1, daterange('2015-01-01', 'infinity'), 'a', 'x'),
(1, daterange('2015-01-01', 'infinity'), 'b', 'y');
This query will show you the joined rows.
select *
from parent p
left join child c
on p.p_id = c.p_id
and p.valid_from_to = c.valid_from_to;
To invalidate a row, update the date range. This row (below) was valid from '2015-01-01' to '2015-01-31'. That is, it was soft deleted on 2015-01-31.
update parent
set valid_from_to = daterange('2015-01-01', '2015-01-31')
where p_id = 1 and valid_from_to = daterange('2015-01-01', 'infinity');
Insert a new valid row for p_id 1, and pick up the child rows that were invalidated on Jan 31.
insert into parent values (1, 'r', daterange(current_date, 'infinity'));
update child set valid_from_to = daterange(current_date, 'infinity')
where p_id = 1 and valid_from_to = daterange('2015-01-01', '2015-01-31');
Richard T Snodgrass's seminal book Developing Time-Oriented Database Applications in SQL is available free from his university web page.
You can use a trigger:
CREATE OR REPLACE FUNCTION trg_upaft_upd_trip()
RETURNS TRIGGER AS
$func$
BEGIN
UPDATE submission s
SET is_active = NULL
WHERE s.assessment_id = NEW.assessment_id
AND NEW.is_active IS NULL; -- recheck to be sure
RETURN NEW; -- call this BEFORE UPDATE
END
$func$ LANGUAGE plpgsql;
CREATE TRIGGER upaft_upd_trip
BEFORE UPDATE ON assessment
FOR EACH ROW
WHEN (OLD.is_active AND NEW.is_active IS NULL)
EXECUTE PROCEDURE trg_upaft_upd_trip();
Related:
How do I make a trigger to update a column in another table?
Be aware that a trigger has more possible points of failure than a FK constraints with ON UPDATE CASCADE ON DELETE CASCADE.
#Mike added a solution with a multi-column FK constraint I would consider as alternative.
Related answer on dba.SE:
Enforcing constraints “two tables away”
Related answer one week later:
Cross table constraints in PostgreSQL
This is more a schematic problem than a procedural one.
You may have dodged creating a solid definition of "what constitutes a record". At the moment you have object A that may be referenced by object B, and when A is "deleted" (has its is_active column set to FALSE, or NULL, in your current case) B is not reflecting that. It sounds like this is a single table (you only mention rows, not separate classes or tables...) and you have a hierarchical model formed by self-reference. If that is the case you can think of the problem in a few ways:
Recursive lineage
In this model you have one table that contains all the data in one place, whether its a parent, a child, etc. and you check the table for recursive references to traverse the tree.
It is tricky to do this properly in an ORM that lacks explicit support for this without accidentally writing routines that either:
iteratively pound the crap out of your DB by making at least one query per node or
pulling the entire table at once and traversing it in application code
It is, however, straightforward to do this in Postgres and let Django access it via a model over an unmanaged view on the lineage query you build. (I wrote a little about this once.) Under this model your query will descend the tree until it hits the first row of the current branch that is marked as not active and stop, thus effectively truncating all the rows below associated with that one (no need for propagating the is_active column!).
If this were, say, a blog entry + comments within the same structure (a fairly common CMS schema) then any row that is its own parent is a primary entity and anything that has a parent that is not itself is a comment. To remove a whole blog post + its children you mark just the blog post's row as inactive; to remove a thread within the comments mark as inactive the comment that begins that thread.
For a blog + comments type feature this is usually the most straightforward way to do things -- though most CMS systems get it wrong (but usually only in ways that matter if you start doing serious data stuff later, if you're just setting up some place for people to argue on the internet then Worse is Better).
Recursive lineage + External "record" definition
In this model you have your tree of nodes and your primary entities separated. The primary entities are marked as being active or not, and that attribute is common to all the elements that are related to it within the context of that primary entity (they exist and have a meaning independent of it). This means two tables, one for primary entities, and one for your tree of nodes.
Use this when you have something more interesting going on than simply threaded discussion. For example, a model of components where a tree of things may be aggregated separately into other larger things, and you need to have a way to mark those "other larger things" as active or not independently of the components themselves.
Further down the rabbit hole...
There are other takes on this idea, but they get increasingly non-trivial, which is probably not suitable. For example, consider a third basic take on this model where the hierarchy structure, the node bodies, and the primary entities are all separated into different tables. One node body might appear in multiple trees by reference, and multiple trees may be considered active or inactive in the context of a single primary entity, etc.
Consider heading this direction if your data is more complex. If you wind up really needing models this far decomposed ("normalized") then I would caution that any ORM is probably going to wind up being a lot more trouble than its worth -- you will start running headlong into the problem that ORMs are fundamentally leaky abstractions (1 object can never really equate to 1 table...).

How to store the following SQL data optimally in SQL Server 2008

I am creating a page where people can post articles. When the user posts an article, it shows up on a list, like the related questions on Stack Overflow (when you add a new question). It's fairly simple.
My problem is that I have 2 types of users. 1) Unregistered private users. 2) A company.
The unregistered users needs to type in their name, email and phone. Whereas the company users just needs to type in their company name/password. Fairly simple.
I need to reduce the excess database usage and try to optimize the database and build the tables effectively.
Now to my problem in hand:
So I have one table with the information about the companies, ID (guid), Name, email, phone etc.
I was thinking about making one table called articles that contained ArticleID, Headline, Content and Publishing date.
One table with the information about the unregistered users, ID, their name, email and phone.
How do i tie the articles table to the company/unregistered users table. Is it good to make an integer that contains 2 values, 1=Unregistered user and 2=Company and then one field with an ID-number to the specified user/company. It looks like you need a lot of extra code to query the database. Performance? How could i then return the article along with the contact information? You should also be able to return all the articles from a specific company.
So Table company would be:
ID (guid), company name, phone, email, password, street, zip, country, state, www, description, contact person and a few more that i don't have here right now.
Table Unregistered user:
ID (guid), name, phone, email
Table article:
ID (int/guid/short guid), headline, content, published date, is_company, id_to_user
Is there a better approach?
Qualities that I am looking for is: Performance, Easy to query and Easy to maintain (adding new fields, indexes etc)
Theory
The problem you described is called Table Inheritance in data modeling theory. In Martin Fowler's book the solutions are:
single table inheritance: a single table that contains all fields.
class table inheritance: one table per class, with table for abstract classes.
concrete table inheritance: one table per non-abstract class, abstract members are repeated in each concrete table
So from a theory and industry practice point of view all three solutions are acceptable: one table Posters with columns NULLable columns (ie. single table), three tables Posters, Companies and Persons (ie. class inheritance) and two tables Companies and Persons (ie. concrete inheritance).
Now, to pros and cons.
Cost of NULL columns
The record structure is discussed in Inside the Storage Engine: Anatomy of a record:
NULL bitmap
two bytes for count of columns in the record
variable number of bytes to store one bit per column in the
record, regardless of whether the
column is nullable or not (this is
different and simpler than SQL Server
2000 which had one bit per nullable
column only)
So if you have at least one NULLable column, you pay the cost of the NULL bitmap in each record, at least 3 bytes. But the cost is identical if you have 1 or 8 columns! The 9th NULLable column will add a byte to the NULL bitmap in each record. the formula is described in Estimating the Size of a Clustered Index: 2 + ((Num_Cols + 7) / 8)
Peformance Driving Factor
In database system there is really only one factor that drives performance: amount of data scanned. How large are the record scanned by a query plan, and how many records does it have to scan. So to improve the performance you need to:
narrow the records: reduce the data size, covering include indexes, vertical partitioning
reduce the number of records scanned: indexes
reduce the number of scans: eliminate joins
Now in order to analyze these criteria, there is something missing in your post: the prevalent data access pattern, ie. the most common query that the database will be hit with. This is driven by how you display your posts on the site. Consider these possible approaches:
posts front page: like SO, a page of recent posts with header, excerpt, time posted and author basic information (name, gravatar). To get this page displayed you need to join Posts with authors, but you only need the author name and gravatar. Both single table inheritance and class table inheritance would work, but concrete table inheritance would fail. This is because you cannot afford for such a query to do conditional joins (ie. join the articles posted to either Companies or Persons), such a query will be less than optimal.
posts per author: users have to login first and then they'll see their own posts (this is common for non-public post oriented sites, think incident tracking for instance). For such a design, all three table inheritance schemes would work.
Conclusion
There are some general performance considerations (ie. narrow the data) to consider, but the critical information is missing: how are you going to query the data, your access pattern. The data model has to be optimized for that access pattern:
Which fields from Companies and Persons will be displayed on the landing page of the site (ie. the most often and performance critical query) ? You don't want to join 5 tables to show those fields.
Are some Company/Person information fields only needed on the user information page? Perhaps partition the table vertically into CompaniesExtra and PersonsExtra tables. Or use a index that will cover the frequently used fields (this approach simplifies code and is easier to keep consistent, at the cost of data duplication)
PS
Needless to say, don't use guids for ids. Unless you're building a distributed system, they are a horrible choice for reasons of excessive width. Fragmentation is also a potential problem, but that can be alleviated by use of sequential guids.
Ideally if you could use ORM (as mentioned by TFD), I would do so. Since you have not commented on that as well as you always come back with the "performance" question, I assume you would not like to use one.
Using pure SQL, the approach I would suggest would be to have table structure as below:
ActicleOwner [ID (guid)]
Company [ID (guid) - PK as well as FK to ActicleOwner.ID,
company name, phone, email, password, street, zip, ...]
UnregisteredUser [ID (guid) - PK as well as FK to ActicleOwner.ID,
name, phone, email]
Article = [ID (int/guid/short guid), headline, content, published date,
ArticleOwnerID - FK to ActicleOwner.ID]
Lets see usages:
INSERT: overhead is the need to add a row to ActicleOwner table for each Company/UU. This is not the operation that happens so often, there is no need to optimize performance
SELECT:
Company/UU: well, it is easy to search for both UU and Company, since you do not need to JOIN to any other table, as all the info about the required object is in one table
Acticles of one Company/UU: again, you just need to filter on the GUID of the Company/UU, and there you go: SELECT (list fields) FROM Acticle WHERE ArticleOwnerID = #AOID
Also think that one day you might need to support multiple Owners in the Article. With the parent table approach above (or mentioned by Vincent) you will just need to introduce relation table, whereas with 2 NULL-able FK constraints to each Owner table is solution you are kind-of stuck.
Performance:
Are you sure you have performance problem? What is your target?
One thing I can recommend looking at you model regarding performance is not to use GUIDs as clustered index (which is the default for a PK). Because basically your INSERT statements will be inserting data randomly into the table.
Alternatives are:
use Sequential GUID instead (see: What are the performance improvement of Sequential Guid over standard Guid?)
use both INTEGER and GUID. This is someone complicated approach and might be an overkill for a simple model you have, but the result is that you always JOIN tables in SELECTs on INTEGER instead of GUID, which is much faster.
So if you are so hot on performance, you might try to do the following:
ActicleOwner (ID (int identity) - PK, UID (guid) - UC)
Company [ID (int) - PK as well as FK to ActicleOwner.ID,
UID (guid) - UC as well as FK to ActicleOwner.UID, company name, ...]
...
Article = [ID (int/guid/short guid), headline, content, published date,
ArticleOwnerID - FK to ActicleOwner.ID (int)]
To INSERT a user (Company/UU) you do the following:
Having a UID (maybe sequential one) from the code, you do INSERT into ActicleOwner table. You get back the autogenerated integer ID.
you insert all the data into Company/UU, including the integer ID that you have just received.
ActicleOwner.ID will be integer, so searching on it will be faster then on UID, especially when you have an index on it.
This is a common OO programming problem that should not be solved in the SQL domain. It should be handled by your ORM
Make two classes in your program code as required and let you ORM map them to a suitable SQL representation. For performance a single table with nulls will do, the only overhead is the discriminator column
Some examples hibernate inheritance
I would suggest the super-type Author for Person and Organization sub-types.
Note that AuthorID serves as the primary and the foreign key at the same time for Person and Organization tables.
So first let's create tables:
CREATE TABLE Author(
AuthorID integer IDENTITY NOT NULL
,AuthorType char(1)
,Phone varchar(20)
,Email varchar(128) NOT NULL
);
ALTER TABLE Author ADD CONSTRAINT pk_Author PRIMARY KEY (AuthorID);
CREATE TABLE Article (
ArticleID integer IDENTITY NOT NULL
,AuthorID integer NOT NULL
,DatePublished date
,Headline varchar(100)
,Content varchar(max)
);
ALTER TABLE Article ADD
CONSTRAINT pk_Article PRIMARY KEY (ArticleID)
,CONSTRAINT fk1_Article FOREIGN KEY (AuthorID) REFERENCES Author(AuthorID) ;
CREATE TABLE Person (
AuthorID integer NOT NULL
,FirstName varchar(50)
,LastName varchar(50)
);
ALTER TABLE Person ADD
CONSTRAINT pk_Person PRIMARY KEY (AuthorID)
,CONSTRAINT fk1_Person FOREIGN KEY (AuthorID) REFERENCES Author(AuthorID);
CREATE TABLE Organization (
AuthorID integer NOT NULL
,OrgName varchar(40)
,OrgPassword varchar(128)
,OrgCountry varchar(40)
,OrgState varchar(40)
,OrgZIP varchar(16)
,OrgContactName varchar(100)
);
ALTER TABLE Organization ADD
CONSTRAINT pk_Organization PRIMARY KEY (AuthorID)
,CONSTRAINT fk1_Organization FOREIGN KEY (AuthorID) REFERENCES Author(AuthorID);
When inserting into Author you have to capture the auto-incremented id and then use it to insert the rest of data into person or organization, depending on AuthorType. Each row in Author has only one matching row in Person or Organization, not in both. Here is an example of how to capture the AuthorID.
-- Insert into table and return the auto-incremented AuthorID
INSERT INTO Author ( AuthorType, Phone, Email )
OUTPUT INSERTED.AuthorID
VALUES ( 'P', '789-789-7899', 'dudete#mmahoo.com' );
Here are a few examples of how to query authors:
-- Return all authors (org and person)
SELECT *
FROM dbo.Author AS a
LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID ;
-- Return all-organization authors
SELECT *
FROM dbo.Author AS a
JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID ;
-- Return all person-authors
SELECT *
FROM dbo.Author AS a
JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
And now all articles with authors.
-- Return all articles with author information
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID ;
There are two ways to return all articles belonging to organizations. The first example returns only columns from the Organization table, while the second one has columns from the Person table too, with NULL values.
-- (1) Return all articles belonging to organizations
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID;
-- (2) Return all articles belonging to organizations
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID
WHERE AuthorType = 'O';
And to return all articles belonging to a specific organization, again two methods.
-- (1) Return all articles belonging to a specific organization
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID
WHERE c.OrgName = 'somecorp';
-- (2) Return all articles belonging to a specific organization
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID
WHERE c.OrgName = 'somecorp';
To make queries simpler, you could package some of this into a view or two.
Just as a reminder, it is common for an article to have several authors, so a many-to-many table Article_Author would be in order.
My preference is to use a table that acts like a super table to both.
ArticleOwner = (ID (guid), company name, phone, email)
company = (ID, password)
unregistereduser = (ID)
article = (ID (int/guid/short guid), headline, content, published date, owner)
Then querying the database will require a JOIN on the 3 tables but this way you do not have the null fields.
I'd suggest instead of two tables create one table Poster.
It's ok to have some fields empty if they are not applicable to one kind of poster.
Poster:
ID (guid), type, name, phone, email, password
where type is 1 for company, 2 - for unregistered user.
OR
Keep your users and companies separate, but require each company to have a user in users table. That table should have a CompanyID field. I think it would be more logical and elegant.
An interesting approach would be to use the Node model followed by Drupal, where everything is effectively a Node and all other data is stored in a secondary table. It's highly flexible and as is evidenced by the widespread use of Drupal in large publishing and discussion sites.
The layout would be something like this:
Node
ID
Type (User, Guest, Article)
TypeID (PKey of related data)
Created
Modified
Article
ID
Field1
Field2
Etc.
User
ID
Field1
Field2
Etc.
Guest
ID
Field1
Field2
Etc.
It's an alternative option with some good benefits. The greatest being flexibility.
I'm not convinced you need to distinguish between companies and persons; only registered and unregistered authors.
I added this for clarity. You could simply use a check constraint on the Authors table to limit the values to U and R.
Create Table dbo.AuthorRegisteredStates
(
Code char(1) not null Primary Key Clustered
, Name nvarchar(15) not null
, Constraint UK_AuthorRegisteredState Unique ( [Name])
)
Insert dbo.AuthorRegisteredState(Code, Name) Values('U', 'Unregistered')
Insert dbo.AuthorRegisteredState(Code, Name) Values('R', 'Registered')
GO
The key in any database system is data integrity. So, we want to ensure that usernames are unique and, perhaps, that Names are unique. Do you want to allow two people with the same name to publish an article? How would the reader differentiate them? Notice that I don't care whether the Author represents a company or person. If someone is registering a company or a person, they can put in a first name and last name if they want. However, what is required is that everyone enter a name (think of it as a display name). We would never search for authors based on anything other than name.
Create Table dbo.Authors
(
Id int not null identity(1,1) Primary Key Clustered
, AuthorStateCode char(1) not null
, Name nvarchar(100) not null
, Email nvarchar(300) null
, Username nvarchar(20) not null
, PasswordHash nvarchar(50) not null
, FirstName nvarchar(25) null
, LastName nvarchar(25) null
...
, Address nvarchar(max) null
, City nvarchar(40) null
...
, Website nvarchar(max) null
, Constraint UK_Authors_Name Unique ( [Name] )
, Constraint UK_Authors_Username Unique ( [Username] )
, Constraint FK_Authors_AuthorRegisteredStates
Foreign Key ( AuthorStateCode )
References dbo.AuthorRegisteredStates ( Code )
-- optional. if you really wanted to ensure that an author that was unregistered
-- had a firstname and lastname. However, I'd recommend enforcing this in the GUI
-- if anywhere as it really does not matter if they
-- enter a first name and last name.
-- All that matters is whether they are registered and entered a name.
, Constraint CK_Authors_RegisteredWithFirstNameLastName
Check ( State = 'R' Or ( State = 'U' And FirstName Is Not Null And LastName Is Not Null ) )
)
Can a single author publish two articles on the same date and time? If not (as I've guessed here), then we add a unique constraint. The question is whether you might need to identify an article. What information might you be given to locate an article besides the general date it was published?
Create Table dbo.Articles
(
Id int not null identity(1,1) Primary Key Clustered
, AuthorId int not null
, PublishedDate datetime not null
, Headline nvarchar(200) not null
, Content nvarchar(max) null
...
, Constraint UK_Articles_PublishedDate Unique ( AuthorId, PublishedDate )
, Constraint FK_Articles_Authors
Foreign Key ( AuthorId )
References dbo.Authors ( Id )
)
In addition, I would add an index on PublishedDate to improve searches by date.
Create Index IX_Articles_PublishedDate dbo.Articles On ( PublishedDate )
I would also enable free text search to search on the contents of articles.
I think concerns about "empty space" are probably premature optimization. The effect on performance will be nil. This is a case where a small amount of denormalizing costs you nothing in terms of performance and gains you in terms of development. However, if it really concerned you, you could move the address information into 1:1 table like so:
Create Table dbo.AuthorAddresses
(
AuthorId int not null Primary Key Clustered
, Street nvarchar(max) not null
, City nvarchar(40) not null
...
, Constraint FK_AuthorAddresses_Authors
Foreign Key ( AuthorId )
References dbo.Authors( Id )
)
This will add a small amount of complexity to your middle-tier. As always, the question is whether the elimination of some empty space exceeds the cost in terms of coding and testing. Whether you store this information as columns in your Authors table or in a separate table, the effect on performance will be nil.
I have solved similar problems by an approach similar to this:
Company -> Company
Articles User -> UserArticles
Articles
CompanyArticles contains a mapping from Company to an Article
UserArticles contains a mapping from User to Article
Article doesn't know anything about who created it.
By inverting the dependencies here you end up not overloading the meaning of foreign keys, having unused foreign keys, or creating a super table.
Getting all articles and contact information would look like:
SELECT name, phone, email FROM
user
JOIN userarticles on user.user_id = userarticles.user_id
JOIN articles on userarticles.article_id = article.article_id
UNION
SELECT name, phone, email FROM
company
JOIN companyarticles on company.company_id = companyarticles.company_id
JOIN articles on companyarticles.article_id = article.article_id

Resources