I would like to know what are the advantages/performance impacts when we transform a IS_A hierarchy to relations. Is it better to transform to keep the 3 tables (or) using seperate tables for Faculty and student?And also if (X,Y) is a key of a relation.Can either of them be a super key of relation ?
Person(Pid,name,age)
Faculty(Pid,rank)
Student(Pid,gpa)
Many times through the years it has occurred to me that best designs for prevailing theory and best designs for practical application are moving further and further apart. The design you show is a poor one in that it is susceptible to data anomalies. Such as: nothing prevents the PID of a faculty member from being entered into the Student table and vice versa.
There must be a way to specify that a PID is that of faculty or student (or both if that is allowed). The the Faculty and Student tables must be designed to adhere to that specification.
Fortunately, that is not difficult. Call it a hybrid intersection or cross table that comes between the main entity and the derived entities. This not only connects the derived entity to the main entity, but also defines the type of derivation. Here is a minimal definition:
create table FacultyOrStudent(
PID int not null references Person( PID ),
PersonType char( 1 ) check( PersonType in ( 'F', 'S' )),
constraint PK_FacultyOrStudent primary key( PID, PersontType )
);
There could well be other fields like a date that the person joined the faculty or student body.
The PK allows the same person to be both a faculty member and student. If that is not allowed, the PK would be the PID field alone. However, in that case, (PID, PersonType) would be defined as unique. I'll elaborate below.
Unlike a standard intersection table, the only foreign key is the PID back to the Person table or the main entity. It cannot also be a FK to the derived entity as that is defined in different tables. However, nothing prevents us from having it the target of a FK reference from those other tables. Thus the defining of (PID, PersonType) as either the PK or as unique.
Here, then, are the derived entities:
create table FacultyPerson(
PID int not null primary key,
FacultyType char( 1 ) check( FacultyType = 'F' ),
Rank ranktype,
constraint FK_FacultyToDefinition foreign key( PID, FacultyType )
references FacultyOrStudent( PID, PersonType )
);
create table StudentPerson(
PID int not null primary key,
StudentType char( 1 ) check( StudentType = 'S' ),
GPA gpatype,
constraint FK_StudentToDefinition foreign key( PID, StudentType )
references FacultyOrStudent( PID, PersonType )
);
The same PID cannot be used more than once as either a faculty member or student. Most importantly, it is not possible to add the PID to the FacultyPerson or StudentPerson tables that is not previously defined as a faculty member or student, respectively, in the FacultyOrStudent table.
To make the work of the application developers easier (and because my personal rule is not to all apps direct access to tables), create two views which provides all the faculty data and student data.
create view Faculty as
select f.PID, p.name, p.age, fp.Rank
from FacultyPerson fp
join FacultyOrStudent fos
on fos.PID = fp.PID
and fos.PersonType = fp.FacultyType
join Person p
on p.PID = fos.PID;
create view Students as
select sp.PID, p.name, p.age, sp.GPA
from StudentPerson sp
join FacultyOrStudent fos
on fos.PID = sp.PID
and fos.PersonType = sp.StudentType
join Person p
on p.PID = fos.PID;
This allows the apps to access the data in the form they most need. Triggers on the views also allow all DML operations in that same form. The apps don't need to know what actual form the underlying data is in. This gives the database developers the added convenience of being free to change the underlying data without worrying about the impact on the apps. Just change the views appropriately.
The names of the objects I use are for illustration only. Naming is per personal preference and/or corporate rules.
I also hardcoded the 'F' and 'S' values. Again, for illustration. These would much better be placed in their own lookup table with the field in FacultyOrStudent as a FK. This allows for scalability. To add other types of staff, Secretarial (S) or Custodial (C) or Maintenance (M) or whatever, just add the definition(s) to the lookup table and create the needed table(s) and view(s).
In short, do not transform. Keep the tables and add whatever other tables may be needed to maintain strict data integrity. Data integrity is your top priority in database design.
I am trying to model the following in a postgres db.
I have N number of 'datasets'. These datasets are things like survey results, national statistics, aggregated data etc. They each have a name a source insitution a method etc. This is the meta data of a dataset and I have tables created for this and tables for codifying the research methods etc. The 'root' meta-data table is called 'Datasets'. Each row represents one dataset.
I then need to store and access the actual data associated with this dataset. So I need to create a table that contains that data. How do I represent the relationship between this table and its corresponding row in the 'Datasets' table?
an example
'hea' is a set of survey responses. it is unaggregated so each row is one survey response. I create a table called 'HeaData' that contains this data.
'cso' is a set of aggregated employment data. each row is a economic sector. I create a table called 'CsoData' that contains this data
I create a row for each of these in the 'datasets' table with the relevant meta data for each and they have ids of 1 & 2 respectively.
what is the best way to relate 1 to the HeaData table and 2 to the CsoData table?
I will eventually be accessing this data with scala slick so if the database design could just 'plug and play' with slick that would be ideal
Add a column to the Datasets table which designates which type of dataset it represents. Then a 1 may mean HEA and 2 may mean CSO. A check constraint would limit the field to one of the two values. If new types of datasets are added later, the only change needed is to change the constraint. If it is defined as a foreign key to a "type of dataset" table, you just need to add the new type of dataset there.
Form a unique index on the PK and the new field.
Add the same field to each of the subtables. But the check constraint limits the value in the HEA table to only that value and the CSO table to only that value. Then form the ID field of Datasets table and the new field as the FK to Datasets table.
This limits the ID value to only one of the subtables and it must be the one defined in the Datasets table. That is, if you define a HEA dataset entry with an ID value of 1000 and the HEA type value, the only subtable that can contain an ID value of 1000 is the HEA table.
create table Datasets(
ID int identity/auto_generate,
DSType char( 3 ) check( DSType in( 'HEA', 'CSO' ),
[everything else],
constraint PK_Datasets primary key( ID ),
constraint UQ_Dateset_Type unique( ID, DSType ) -- needed for references
);
create table HEA(
ID int not null,
DSType char( 3 ) check( DSType = 'HEA' ) -- making this a constant value
[other HEA data],
constraint PK_HEA primary key( ID ),
constraint FK_HEA_Dataset_PK foreign key( ID )
references Dataset( ID ),
constraint FK_HEA_Dataset_Type foreign key( ID, DSType )
references Dataset( ID, DSType )
);
The same idea with the CSO subtable.
I would recommend an HEA and CSO view that would show the complete dataset rows, metadata and type-specific data, joined together. With triggers on those views, they can be the DML points for the application code. Then the apps don't have to keep track of how that data is laid out in the database, making it a lot easier to make improvements should the opportunity present itself.
I am trying to figure out the best way to relate these tables together. Suppose I have the following tables:
tblPerson
tblGroup
tblResource
Each row in each of these tables can have multiple email addresses associated with them so I would want a separate table and relate it back.
Are there methods to have a single table (tblEmail) relate back to each of the tables. I thought of using a uniqueidentifier field in each of the parent tables and using that as a key in the email table. It would be guaranteed unique. I just wouldn't be able to create a FK in the email table to preserve integrity. That is manageable though.
Is there a fancy way to do this? I am creating these tables in SQL 2008 R2.
Thank you
Karl
While it may be tempting to try and use a single email table with a ParentType (Person/Group/Resource) and ParentID, this is dangerous and means you can't have the relationship defined in SQL (unless there's some feature I'm unaware of?).
If you want to have referential integrity in SQL you really need to create 3 tables, one for each parent table.
CREATE TABLE dbo.PersonEmail (
ID int IDENTITY PRIMARY KEY,
PersonID int,
EmailAddress varchar(500)
)
CREATE TABLE dbo.GroupEmail (
ID int IDENTITY PRIMARY KEY,
GroupID int,
EmailAddress varchar(500)
)
CREATE TABLE dbo.ResourceEmail (
ID int IDENTITY PRIMARY KEY,
ResourceID int,
EmailAddress varchar(500)
)
If you think you might extend your Email table to later include a DisplayName, and perhaps a BounceCount and others, create a table for Email and create many-to-many join tables to link them to Person/Group/Resource.
Be aware that edits might impact multiple links, you'll have to decide how you want to handle that.
This is a core part of SQL. In a proper relational design, you don't relate email addresess to perosns, groups, or resources -- you relate the persons, groups, and resources TO the email.
So, with an email table of:
CREATE TABLE dbo.tblEmail (
emailID int IDENTITY PRIMARY KEY,
email varchar(500)
)
If you only need one email per entity, you would just insert an emailID on each of the other fields that model something that may need an email.
ALTER TABLE dbo.tblPerson
ADD emailID int REFERENCES dbo.tblEmail(emailID);
ALTER TABLE dbo.tblGroup
ADD emailID int REFERENCES dbo.tblEmail(emailID);
ALTER TABLE dbo.tblResource
ADD emailID int REFERENCES dbo.tblEmail(emailID);
If you need multiple email addresses per entity, you need to insert an additional table, to interpolate the set of email addresses to a particular address. (I wouldn't do this unless you have a technical reason to handle the addresses individually, such as a bulk-email system where you want to avoid duplicates if someone uses the same email for their own use and their organization's use.)
CREATE TABLE dbo.tblEmail (
emailID int IDENTITY PRIMARY KEY
)
CREATE TABLE dbo.tblEmailAddress (
eAddrID IDENTITY PRIMARY KEY,
eAddr varchar(500)
)
CREATE TABLE dbo.tblEmailSet (
emailID int REFERENCES dbo.tblEmail(emailID),
eAddrID int REFERENCES dbo.tblEmailAddresses(eAddrID),
)
In order to, say, return a list of all emails to any Person, Group, or Resource named "Smith", you'd run the query below:
SELECT DISTINCT A.eAddr
FROM (
SELECT emailID FROM dbo.tblPerson WHERE Name = 'Smith'
UNION
SELECT emailID FROM dbo.tblGroup WHERE Name = 'Smith'
UNION
SELECT emailID FROM dbo.tblResource WHERE Name = 'Smith'
) AS PGR
INNER JOIN dbo.tblEmailSet AS S
ON PGR.emailID = S.emailID
INNER JOIN dbo.tblEmailAddress AS A
ON S.eAddrID = A.eAddrID
That ugly UNION, btw, is one of the reasons why you really don't want to do this unless you have a technical need to retrieve the data uniquely. While I've done this sort of many-to-many-to-many join on occasion, in this particular instance it's kind of a "code smell" and an indicator that instead of tracking "People", "Groups", and "Resources", you should be tracking "Contacts" with a "type" indicator to tell if a contact is a Person, a Group, or a Resource.
(Or maybe you never need to grab a bunch of email addresses, and just want a single table of emails you can check for whitelisting...)
So you want to have possibly multiple Emails per Person/Group/Resource, and you want all those emails in one table, am I correct ?
To do that, I would create a table dbo.EmailAddress such as this :
CREATE TABLE dbo.EmailAddress
(
EmailID BIGINT IDENTITY(1,1) NOT NULL PRIMARY KEY
,EmailAddress VARCHAR(250) NOT NULL
CONSTRAINT UK_EmailAddress UNIQUE(EmailAddress) --to ensure that you never insert twice the same email address
)
Then I would create the relation between your Person/Group/Resource and you emails using another table :
CREATE TABLE dbo.EmailAddressParentXRef
(
EmailID INT NOT NULL REFERENCES dbo.EmailAddress(EmailID)
,ParentTypeID INT NOT NULL
,PersonID INT NULL REFERENCES dbo.tblPerson(PersonID)
,GroupID INT NULL REFERENCES dbo.tblGroup(GroupID)
,ResourceID INT NULL REFERENCES dbo.tblResource(ResourceID)
CONSTRAINT UK_EmailID_ParentTypeID UNIQUE(EmailID,ParentTypeID) --to make sure you don't put the same EmailID for the same type of Parent (e.g. EmailID=12 twice for an Account)
)
There you would have referential integrity + some checks to avoid duplicates when you load data. Note that I didn't put a check to make sure you actually fill in either the PersonID, GroupID or ResourceID. This can be added in different ways, but if you understand the principle of this table, you shouldn't load any line without those references (or they will just be useless).
A lot more checks can be added based on this, to take care of every type of duplication/error you might create when loading the data, but you get the point.
I have separate assets tables for storing different kind of physical and logical assets, such as:-
Vehicle table( ID, model, EngineSize, Drivername, lastMaintenanceDate)
Server table ( ID, IP, OSName, etc…)
VM (ID, Size, etc…).
VM_IP (VM_ID,IP)
Now the problems I have is:-
For the IP column in the server table and in the VM_IP table, I need this column to be unique in these two tables, so for example the database should not allow a server and a VM to have the same IP. In the current design I can only guarantee uniqueness for the table separately.
So can anyone advice on how I can handle this unique requirement on the databases level.
Regards
::EDITED::
I have currently the following database structure:-
Currently I see these points:-
I have introduced a redundant AssetTypeID column in the base Asset table, so I can know the asset type without having to join tables. This might break normalization.
In my above architecture , I cannot control (on the database level) which asset should have IP, which asset should not have IP and which asset can/cannot have multiple IPs.
So is there a way to improve my architecture to handle these two points.
Thanks in advance for any help.
Create an IP table and use foreign keys
If I were facing the problem in design level, I would add two more tables:
A valid_IP table (containing valid IP range)
A Network_Enabeled, base table for all entities that may have an
IP, like Server table, VM_IP ,... the primary key of this base
table will be the primary key of child tables.
In Network_Enabeled table, Having a foreign key from valid_IP table and setting a unique key on the filed will be the answer.
Hope be helpful.
You can use an indexed view.
CREATE VIEW YourViewName with SCHEMABINDING
as
...
GO
CREATE UNIQUE CLUSTERED INDEX IX_YourIndexName
on YourViewName (..., ...)
Based on your edit, you can introduce a superkey on the asset table and use various constraints to enforce most of what it sounds like you're looking for:
create table Asset (
AssetID int not null primary key,
AssetTypeID int not null
--Skip all of the rest, foreign keys, etc, irrelevant to example
,constraint UQ_Asset_TypeCheck
UNIQUE (AssetID,AssetTypeID) --This is the superkey
)
The above means that the AssetTypeID column can now be checked/enforced in other tables, and there's no risk of inconsistency
create table Servers (
AssetID int not null primary key,
AssetTypeID as 1 persisted,
constraint FK_Servers_Assets FOREIGN KEY (AssetID)
references Asset (AssetID), --Strictly, this will become redundant
constraint FK_Servers_Assets_TypeCheck FOREIGN KEY (AssetID,AssetTypeID)
references Asset (AssetID,AssetTypeID)
)
So, in the above, we enforce that all entries in this table must actually be of the correct asset type, by making it a fixed computed column that is then used in a foreign key back to the superkey.
--So on for other asset types
create table Asset_IP (
AssetID int not null,
IPAddress int not null primary key, --Wrong type, for IPv6
AssetTypeID int not null,
constraint FK_Asset_IP_Assets FOREIGN KEY (AssetID)
references Asset (AssetID), --Again, redundant
constraint CK_Asset_Types CHECK (
AssetTypeID in (1/*, Other types allowed IPs */)),
constraint FK_Asset_IP_Assets_TypeCheck FOREIGN KEY (AssetID,AssetTypeID)
references Asset (AssetID,AssetTypeID)
)
And now, above, we again reference the superkey to ensure that we've got a local (to this table) correct AssetTypeID value, which we can then use in a check constraint to limit which asset types are actually allowed entries in this table.
create unique index UQ_Asset_SingleIPs on Asset_IP (AssetID)
where AssetTypeID in (1/* Type IDs that are only allowed 1 IP address */)
And finally, for certain AssetTypeID values, we ensure that this table only contains one row for that AssetID.
I hope that gives you enough ideas of how to implement your various checks based on types. If you want/need to, you can now construct some views (through which the rest of your code will interact) which hides the extra columns and provides triggers to ease INSERT statements.
On a side note, I'd recommend picking a convention and sticking to it when it comes to table naming. My preferred one is to use the plural/collective name, unless the table is only intended to contain one row. So I'd rename Asset as Assets, for example, or Asset_IP as Asset_IPs. At the moment, you have a mixture.
I am creating a page where people can post articles. When the user posts an article, it shows up on a list, like the related questions on Stack Overflow (when you add a new question). It's fairly simple.
My problem is that I have 2 types of users. 1) Unregistered private users. 2) A company.
The unregistered users needs to type in their name, email and phone. Whereas the company users just needs to type in their company name/password. Fairly simple.
I need to reduce the excess database usage and try to optimize the database and build the tables effectively.
Now to my problem in hand:
So I have one table with the information about the companies, ID (guid), Name, email, phone etc.
I was thinking about making one table called articles that contained ArticleID, Headline, Content and Publishing date.
One table with the information about the unregistered users, ID, their name, email and phone.
How do i tie the articles table to the company/unregistered users table. Is it good to make an integer that contains 2 values, 1=Unregistered user and 2=Company and then one field with an ID-number to the specified user/company. It looks like you need a lot of extra code to query the database. Performance? How could i then return the article along with the contact information? You should also be able to return all the articles from a specific company.
So Table company would be:
ID (guid), company name, phone, email, password, street, zip, country, state, www, description, contact person and a few more that i don't have here right now.
Table Unregistered user:
ID (guid), name, phone, email
Table article:
ID (int/guid/short guid), headline, content, published date, is_company, id_to_user
Is there a better approach?
Qualities that I am looking for is: Performance, Easy to query and Easy to maintain (adding new fields, indexes etc)
Theory
The problem you described is called Table Inheritance in data modeling theory. In Martin Fowler's book the solutions are:
single table inheritance: a single table that contains all fields.
class table inheritance: one table per class, with table for abstract classes.
concrete table inheritance: one table per non-abstract class, abstract members are repeated in each concrete table
So from a theory and industry practice point of view all three solutions are acceptable: one table Posters with columns NULLable columns (ie. single table), three tables Posters, Companies and Persons (ie. class inheritance) and two tables Companies and Persons (ie. concrete inheritance).
Now, to pros and cons.
Cost of NULL columns
The record structure is discussed in Inside the Storage Engine: Anatomy of a record:
NULL bitmap
two bytes for count of columns in the record
variable number of bytes to store one bit per column in the
record, regardless of whether the
column is nullable or not (this is
different and simpler than SQL Server
2000 which had one bit per nullable
column only)
So if you have at least one NULLable column, you pay the cost of the NULL bitmap in each record, at least 3 bytes. But the cost is identical if you have 1 or 8 columns! The 9th NULLable column will add a byte to the NULL bitmap in each record. the formula is described in Estimating the Size of a Clustered Index: 2 + ((Num_Cols + 7) / 8)
Peformance Driving Factor
In database system there is really only one factor that drives performance: amount of data scanned. How large are the record scanned by a query plan, and how many records does it have to scan. So to improve the performance you need to:
narrow the records: reduce the data size, covering include indexes, vertical partitioning
reduce the number of records scanned: indexes
reduce the number of scans: eliminate joins
Now in order to analyze these criteria, there is something missing in your post: the prevalent data access pattern, ie. the most common query that the database will be hit with. This is driven by how you display your posts on the site. Consider these possible approaches:
posts front page: like SO, a page of recent posts with header, excerpt, time posted and author basic information (name, gravatar). To get this page displayed you need to join Posts with authors, but you only need the author name and gravatar. Both single table inheritance and class table inheritance would work, but concrete table inheritance would fail. This is because you cannot afford for such a query to do conditional joins (ie. join the articles posted to either Companies or Persons), such a query will be less than optimal.
posts per author: users have to login first and then they'll see their own posts (this is common for non-public post oriented sites, think incident tracking for instance). For such a design, all three table inheritance schemes would work.
Conclusion
There are some general performance considerations (ie. narrow the data) to consider, but the critical information is missing: how are you going to query the data, your access pattern. The data model has to be optimized for that access pattern:
Which fields from Companies and Persons will be displayed on the landing page of the site (ie. the most often and performance critical query) ? You don't want to join 5 tables to show those fields.
Are some Company/Person information fields only needed on the user information page? Perhaps partition the table vertically into CompaniesExtra and PersonsExtra tables. Or use a index that will cover the frequently used fields (this approach simplifies code and is easier to keep consistent, at the cost of data duplication)
PS
Needless to say, don't use guids for ids. Unless you're building a distributed system, they are a horrible choice for reasons of excessive width. Fragmentation is also a potential problem, but that can be alleviated by use of sequential guids.
Ideally if you could use ORM (as mentioned by TFD), I would do so. Since you have not commented on that as well as you always come back with the "performance" question, I assume you would not like to use one.
Using pure SQL, the approach I would suggest would be to have table structure as below:
ActicleOwner [ID (guid)]
Company [ID (guid) - PK as well as FK to ActicleOwner.ID,
company name, phone, email, password, street, zip, ...]
UnregisteredUser [ID (guid) - PK as well as FK to ActicleOwner.ID,
name, phone, email]
Article = [ID (int/guid/short guid), headline, content, published date,
ArticleOwnerID - FK to ActicleOwner.ID]
Lets see usages:
INSERT: overhead is the need to add a row to ActicleOwner table for each Company/UU. This is not the operation that happens so often, there is no need to optimize performance
SELECT:
Company/UU: well, it is easy to search for both UU and Company, since you do not need to JOIN to any other table, as all the info about the required object is in one table
Acticles of one Company/UU: again, you just need to filter on the GUID of the Company/UU, and there you go: SELECT (list fields) FROM Acticle WHERE ArticleOwnerID = #AOID
Also think that one day you might need to support multiple Owners in the Article. With the parent table approach above (or mentioned by Vincent) you will just need to introduce relation table, whereas with 2 NULL-able FK constraints to each Owner table is solution you are kind-of stuck.
Performance:
Are you sure you have performance problem? What is your target?
One thing I can recommend looking at you model regarding performance is not to use GUIDs as clustered index (which is the default for a PK). Because basically your INSERT statements will be inserting data randomly into the table.
Alternatives are:
use Sequential GUID instead (see: What are the performance improvement of Sequential Guid over standard Guid?)
use both INTEGER and GUID. This is someone complicated approach and might be an overkill for a simple model you have, but the result is that you always JOIN tables in SELECTs on INTEGER instead of GUID, which is much faster.
So if you are so hot on performance, you might try to do the following:
ActicleOwner (ID (int identity) - PK, UID (guid) - UC)
Company [ID (int) - PK as well as FK to ActicleOwner.ID,
UID (guid) - UC as well as FK to ActicleOwner.UID, company name, ...]
...
Article = [ID (int/guid/short guid), headline, content, published date,
ArticleOwnerID - FK to ActicleOwner.ID (int)]
To INSERT a user (Company/UU) you do the following:
Having a UID (maybe sequential one) from the code, you do INSERT into ActicleOwner table. You get back the autogenerated integer ID.
you insert all the data into Company/UU, including the integer ID that you have just received.
ActicleOwner.ID will be integer, so searching on it will be faster then on UID, especially when you have an index on it.
This is a common OO programming problem that should not be solved in the SQL domain. It should be handled by your ORM
Make two classes in your program code as required and let you ORM map them to a suitable SQL representation. For performance a single table with nulls will do, the only overhead is the discriminator column
Some examples hibernate inheritance
I would suggest the super-type Author for Person and Organization sub-types.
Note that AuthorID serves as the primary and the foreign key at the same time for Person and Organization tables.
So first let's create tables:
CREATE TABLE Author(
AuthorID integer IDENTITY NOT NULL
,AuthorType char(1)
,Phone varchar(20)
,Email varchar(128) NOT NULL
);
ALTER TABLE Author ADD CONSTRAINT pk_Author PRIMARY KEY (AuthorID);
CREATE TABLE Article (
ArticleID integer IDENTITY NOT NULL
,AuthorID integer NOT NULL
,DatePublished date
,Headline varchar(100)
,Content varchar(max)
);
ALTER TABLE Article ADD
CONSTRAINT pk_Article PRIMARY KEY (ArticleID)
,CONSTRAINT fk1_Article FOREIGN KEY (AuthorID) REFERENCES Author(AuthorID) ;
CREATE TABLE Person (
AuthorID integer NOT NULL
,FirstName varchar(50)
,LastName varchar(50)
);
ALTER TABLE Person ADD
CONSTRAINT pk_Person PRIMARY KEY (AuthorID)
,CONSTRAINT fk1_Person FOREIGN KEY (AuthorID) REFERENCES Author(AuthorID);
CREATE TABLE Organization (
AuthorID integer NOT NULL
,OrgName varchar(40)
,OrgPassword varchar(128)
,OrgCountry varchar(40)
,OrgState varchar(40)
,OrgZIP varchar(16)
,OrgContactName varchar(100)
);
ALTER TABLE Organization ADD
CONSTRAINT pk_Organization PRIMARY KEY (AuthorID)
,CONSTRAINT fk1_Organization FOREIGN KEY (AuthorID) REFERENCES Author(AuthorID);
When inserting into Author you have to capture the auto-incremented id and then use it to insert the rest of data into person or organization, depending on AuthorType. Each row in Author has only one matching row in Person or Organization, not in both. Here is an example of how to capture the AuthorID.
-- Insert into table and return the auto-incremented AuthorID
INSERT INTO Author ( AuthorType, Phone, Email )
OUTPUT INSERTED.AuthorID
VALUES ( 'P', '789-789-7899', 'dudete#mmahoo.com' );
Here are a few examples of how to query authors:
-- Return all authors (org and person)
SELECT *
FROM dbo.Author AS a
LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID ;
-- Return all-organization authors
SELECT *
FROM dbo.Author AS a
JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID ;
-- Return all person-authors
SELECT *
FROM dbo.Author AS a
JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
And now all articles with authors.
-- Return all articles with author information
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID ;
There are two ways to return all articles belonging to organizations. The first example returns only columns from the Organization table, while the second one has columns from the Person table too, with NULL values.
-- (1) Return all articles belonging to organizations
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID;
-- (2) Return all articles belonging to organizations
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID
WHERE AuthorType = 'O';
And to return all articles belonging to a specific organization, again two methods.
-- (1) Return all articles belonging to a specific organization
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID
WHERE c.OrgName = 'somecorp';
-- (2) Return all articles belonging to a specific organization
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID
WHERE c.OrgName = 'somecorp';
To make queries simpler, you could package some of this into a view or two.
Just as a reminder, it is common for an article to have several authors, so a many-to-many table Article_Author would be in order.
My preference is to use a table that acts like a super table to both.
ArticleOwner = (ID (guid), company name, phone, email)
company = (ID, password)
unregistereduser = (ID)
article = (ID (int/guid/short guid), headline, content, published date, owner)
Then querying the database will require a JOIN on the 3 tables but this way you do not have the null fields.
I'd suggest instead of two tables create one table Poster.
It's ok to have some fields empty if they are not applicable to one kind of poster.
Poster:
ID (guid), type, name, phone, email, password
where type is 1 for company, 2 - for unregistered user.
OR
Keep your users and companies separate, but require each company to have a user in users table. That table should have a CompanyID field. I think it would be more logical and elegant.
An interesting approach would be to use the Node model followed by Drupal, where everything is effectively a Node and all other data is stored in a secondary table. It's highly flexible and as is evidenced by the widespread use of Drupal in large publishing and discussion sites.
The layout would be something like this:
Node
ID
Type (User, Guest, Article)
TypeID (PKey of related data)
Created
Modified
Article
ID
Field1
Field2
Etc.
User
ID
Field1
Field2
Etc.
Guest
ID
Field1
Field2
Etc.
It's an alternative option with some good benefits. The greatest being flexibility.
I'm not convinced you need to distinguish between companies and persons; only registered and unregistered authors.
I added this for clarity. You could simply use a check constraint on the Authors table to limit the values to U and R.
Create Table dbo.AuthorRegisteredStates
(
Code char(1) not null Primary Key Clustered
, Name nvarchar(15) not null
, Constraint UK_AuthorRegisteredState Unique ( [Name])
)
Insert dbo.AuthorRegisteredState(Code, Name) Values('U', 'Unregistered')
Insert dbo.AuthorRegisteredState(Code, Name) Values('R', 'Registered')
GO
The key in any database system is data integrity. So, we want to ensure that usernames are unique and, perhaps, that Names are unique. Do you want to allow two people with the same name to publish an article? How would the reader differentiate them? Notice that I don't care whether the Author represents a company or person. If someone is registering a company or a person, they can put in a first name and last name if they want. However, what is required is that everyone enter a name (think of it as a display name). We would never search for authors based on anything other than name.
Create Table dbo.Authors
(
Id int not null identity(1,1) Primary Key Clustered
, AuthorStateCode char(1) not null
, Name nvarchar(100) not null
, Email nvarchar(300) null
, Username nvarchar(20) not null
, PasswordHash nvarchar(50) not null
, FirstName nvarchar(25) null
, LastName nvarchar(25) null
...
, Address nvarchar(max) null
, City nvarchar(40) null
...
, Website nvarchar(max) null
, Constraint UK_Authors_Name Unique ( [Name] )
, Constraint UK_Authors_Username Unique ( [Username] )
, Constraint FK_Authors_AuthorRegisteredStates
Foreign Key ( AuthorStateCode )
References dbo.AuthorRegisteredStates ( Code )
-- optional. if you really wanted to ensure that an author that was unregistered
-- had a firstname and lastname. However, I'd recommend enforcing this in the GUI
-- if anywhere as it really does not matter if they
-- enter a first name and last name.
-- All that matters is whether they are registered and entered a name.
, Constraint CK_Authors_RegisteredWithFirstNameLastName
Check ( State = 'R' Or ( State = 'U' And FirstName Is Not Null And LastName Is Not Null ) )
)
Can a single author publish two articles on the same date and time? If not (as I've guessed here), then we add a unique constraint. The question is whether you might need to identify an article. What information might you be given to locate an article besides the general date it was published?
Create Table dbo.Articles
(
Id int not null identity(1,1) Primary Key Clustered
, AuthorId int not null
, PublishedDate datetime not null
, Headline nvarchar(200) not null
, Content nvarchar(max) null
...
, Constraint UK_Articles_PublishedDate Unique ( AuthorId, PublishedDate )
, Constraint FK_Articles_Authors
Foreign Key ( AuthorId )
References dbo.Authors ( Id )
)
In addition, I would add an index on PublishedDate to improve searches by date.
Create Index IX_Articles_PublishedDate dbo.Articles On ( PublishedDate )
I would also enable free text search to search on the contents of articles.
I think concerns about "empty space" are probably premature optimization. The effect on performance will be nil. This is a case where a small amount of denormalizing costs you nothing in terms of performance and gains you in terms of development. However, if it really concerned you, you could move the address information into 1:1 table like so:
Create Table dbo.AuthorAddresses
(
AuthorId int not null Primary Key Clustered
, Street nvarchar(max) not null
, City nvarchar(40) not null
...
, Constraint FK_AuthorAddresses_Authors
Foreign Key ( AuthorId )
References dbo.Authors( Id )
)
This will add a small amount of complexity to your middle-tier. As always, the question is whether the elimination of some empty space exceeds the cost in terms of coding and testing. Whether you store this information as columns in your Authors table or in a separate table, the effect on performance will be nil.
I have solved similar problems by an approach similar to this:
Company -> Company
Articles User -> UserArticles
Articles
CompanyArticles contains a mapping from Company to an Article
UserArticles contains a mapping from User to Article
Article doesn't know anything about who created it.
By inverting the dependencies here you end up not overloading the meaning of foreign keys, having unused foreign keys, or creating a super table.
Getting all articles and contact information would look like:
SELECT name, phone, email FROM
user
JOIN userarticles on user.user_id = userarticles.user_id
JOIN articles on userarticles.article_id = article.article_id
UNION
SELECT name, phone, email FROM
company
JOIN companyarticles on company.company_id = companyarticles.company_id
JOIN articles on companyarticles.article_id = article.article_id