Storing detailed data in SQL Server - sql-server

I’m designing a database in which I save votes.
I’ve created a table:
CREATE TABLE [dbo].[users_votes](
[id] [bigint] NOT NULL,
[like_votes] [int] NOT NULL DEFAULT ((0)),
[dislike_votes] [int] NOT NULL DEFAULT ((0)),
[commented_votes] [int] NOT NULL DEFAULT ((0)),
[comments_likes] [int] NOT NULL DEFAULT ((0))
The issue is that there is a requirement to also store the breakdown data by location.
So for example if user_votes table has 1,000 like_votes for a specific id, I need to know the break down by location, e.g.:
United States 340
France 155
Denmark 25
Brazil 290
Australia 190
I’m getting the data from the client as comma delimited String, for example:
(1,2,45,67,87,112,234) and the country code for location (us, au, ca, etc...).
I’ve been thinking about a few possibilities to store this data, but wanted to know which of these approached is best suited (if any).
As the number of country codes is finite, I can expand users_votes table and add columns with country codes for each criteria. E.g. like_votes_us, dislike_votes_us, comment_votes_us, comment_likes_us.
In this case I will probably use Dynamic SQL to insert/update the data.
Create new tables for each column, so for example I will have a table named like_votes, in which I will have an id, external_id which will be users_votes (table) id, country_code, and count column. So the data will be stored in users_votes and also in like_votes table. I will have a record for each combination of external_id and country code.
In this case I will need to iterate the inserted data in order to determine if this external_id combination exists (and then just increment it) or it needs to be inserted.
Which approach, if any, is the optimal way to store this data so it will be easy to insert/update and also to query?

This type of table design you have at the moment isn't a good idea, in all honesty. One big important point of building a good relational database is using Normal Form. i'm not going to explain what that is here, as there are 10's of thousands of articles on the internet explaining it, and its different iterations (from 1NF to 6NF iirc).
Anyway, you can easily do this with a few tables. I having to guess a lot of your set up here, but hopefully you'll be able to extrapolate what you need, and adjust what doesn't.
Firstly, let's start with a client table:
CREATE TABLE dbo.Client (ClientID int IDENTITY(1,1),
ClientName varchar(100), --You should really split this into Title, Forename and Surname, I'm just being "lazy" here
ClientCountryID int, --Not sure if a Client is related to a country or the vote is, i've guessed the client is.
DOB date,
EmailAddress varchar(100));
GO
So, we have a simple Client Table now. Next, we want a Country Table. This is very simple:
CREATE TABLE dbo.Country (CountryID int IDENTITY(1,1),
CountryName varchar(100),
CountryCode char(2)); --For example UK for United Kingdom, FR for France, etc
GO
You might want to store additional content there, but I don't know your set up.
Now, this is where I'm really guessing a lot. I'm assuming that your likes and dislikes, etc, are linked to something. What, I have no idea, so, I'm going to have a table called "Content", however, not knowing what these likes are against, I have no context for this table, thus it's going to be very basic:
CREATE TABLE dbo.Content (ContentID int IDENTITY(1,1),
ContentType int, --Guessing might be types, maybe videos, Comments, articles? I have no idea to be honest)
ContentParent int, --Comments are joined to a Content (just like here on SO)? I'll guess it's possible
Content nvarchar(MAX)); --because I have no idea what's going in there
--Very simple Content Type Table
CREATE TABLE dbo.ContentType (TypeID int IDENTITY(1,1),
TypeDescription varchar(100));
GO
Now, finally, we can get onto the votes that you want to store; which might look something like this:
CREATE TABLE dbo.Vote (VoteID int IDENTITY(1,1),
ClientID int,
ContentID int,
Liked bit); --1 for Liked, 0 for Disliked, NULL for N/A perhaps?
GO
Ok, now we have some tables. Now I realise I haven't given any kind of Sample data to go in here, so I'll provide a few INSERTS statements for you, so you can get the idea:
INSERT INTO dbo.Country (CountryName, CountryCode)
VALUES ('United Kingdom','GB'),
('France','FR'),
('Germany','DE');
GO
INSERT INTO dbo.Client (ClientName, ClientCountryID, DOB, EmailAddress)
VALUES ('Mr John Smith',1, '19880106','Bob#gmial.com'),
('Ms Penelope Vert',2,'19930509','PVert#mfn.com');
GO
INSERT INTO dbo.ContentType (TypeDescription)
VALUES ('Video'),('Article'),('Comment');
GO
INSERT INTO dbo.Content (ContentType, ContentParent, Content)
VALUES (2, NULL, 'This is my first article, hi everyone!'),
(3, 1, 'Nice! Good to see you''re finally posting!'),
(1, NULL, 'http://youtube.com');
GO
--And now some votes:
INSERT INTO dbo.Vote (ClientID, ContentID, Liked)
VALUES (1, 1, 1),
(2, 1, 1),
(2, 2, 1),
(2, 3, 0);
GO
Notice how I've put the votes in. I've not aggregated in the table; doing so is an awful idea. instead store each vote individually and use a query to Aggregate. You can easily do this, for example:
SELECT C.ContentID,
Cy.CountryName,
COUNT(CASE V.Liked WHEN 1 THEN 1 END) AS LikedVotes,
COUNT(CASE V.Liked WHEN 0 THEN 1 END) AS DisLikedVotes
FROM dbo.Content C
JOIN dbo.Vote V ON C.ContentID = V.ContentID
JOIN dbo.Client CV ON V.ClientID = CV.ClientID
JOIN dbo.Country Cy ON CV.ClientCountryID = Cy.CountryID
GROUP BY C.ContentID,
Cy.CountryName;
This gives you the number of Liked Votes per Content Item, and splits it into Countries as well for you. If you want to put these countries into their own columns, then I strongly suggest doing this in your presentation layer, not your SQL (as you'll have to use Dynamic SQL, and (no offence) I imagine this is beyond your skills at the moment based on your current database design choice(s)). Excel is very good at going this using Pivot tables. if you want to retain the process in SQL Server, consider using SSRS and a matrix.
If you have any questions, please do ask.
Note: I have no made any kind of foreign keys, constraints, Default values, etc here. These are a definate must for any good database design.
Clean Up script:
DROP TABLE dbo.Client;
DROP TABLE dbo.Country;
DROP TABLE dbo.Vote;
DROP TABLE dbo.Content;
DROP TABLE dbo.ContentType;
GO

Related

How can I insert rows of one table into multiple tables using a SQL Server stored procedure?

I am interested in inserting my rows of tempDataTable into two tables.
This is the table design of my tempdatatable:
The two tables I want to create via the stored procedure from my TempDataTable (one in the image).
Design for the two new table would be something like;
Table one (Product): ProductID (PK), ProductName, Product URL
Table two (ProductPricing): ProductPricingID(PK), ProductId (FK), price, priceperunit, Date
It's been a complete day I am searching for a solution, and will kept doing this but I am unable to exact solution. I am not experience with SQL but this is something I have to do.
Okay, I'm not sure exactly where you are struggling, so here's a script that sort of does what you asked for. None of this is too hard to follow, so maybe have a scan through it, and then let me know which bits are confusing?
Set up the table structure:
CREATE TABLE tempDataTable (
TempProductId INT,
TempProductUrl VARCHAR(512),
TempProductPrice VARCHAR(50),
TempProductPricePerUnit VARCHAR(50),
TempProductName VARCHAR(512));
INSERT INTO tempDataTable SELECT 2491, 'https://yadayada1', '£1.65/unit', '46p/100g', 'Yeo Valley Little Yeos, blah';
INSERT INTO tempDataTable SELECT 2492, 'https://yadayada2', '60p/unit', '1p/ea', 'Sainsbury''s Little Ones, etc';
CREATE TABLE Product (
ProductId INT PRIMARY KEY,
ProductName VARCHAR(512),
ProductUrl VARCHAR(512));
CREATE TABLE ProductPricing (
ProductPricingId INT IDENTITY(1,1) PRIMARY KEY,
ProductId INT,
ProductPrice VARCHAR(50),
ProductPricePerUnit VARCHAR(50),
ProductPricingDate DATETIME);
ALTER TABLE ProductPricing ADD CONSTRAINT foreignkey$ProductPricing$Product FOREIGN KEY (ProductId) REFERENCES Product (ProductId);
This gives me three tables to play with, one with some temporary data in it, and two that you want to push the data into, with a couple of primary keys, and a foreign key constraint to ensure integrity between the two tables.
Good so far?
Now to split the data between the two tables is as simple as:
INSERT INTO Product (ProductId, ProductName, ProductUrl) SELECT TempProductId, TempProductName, TempProductUrl FROM tempDataTable;
INSERT INTO ProductPricing (ProductId, ProductPrice, ProductPricePerUnit, ProductPricingDate) SELECT TempProductId, TempProductPrice, TempProductPricePerUnit, GETDATE() FROM tempDataTable;
If you run that then you should end up with data in your two tables, like this:
Product
ProductId ProductName ProductUrl
2491 Yeo Valley Little Yeos, blah https://yadayada1
2492 Sainsbury's Little Ones, etc https://yadayada2
ProductPricing
ProductPricingId ProductId ProductPrice ProductPricePerUnit ProductPricingDate
1 2491 £1.65/unit 46p/100g 2020-04-27 14:29:14.657
2 2492 60p/unit 1p/ea 2020-04-27 14:29:14.657
Now there's a whole load of questions that arise from this:
how are you going to cope with running this more than once, because the second time you run it there will be primary key violations?
do you want to clear down the temporary data somehow on successful completion?
do you want to use the system date as the pricing date, or are there more columns off the edge of your image?
do you want to check the data for duplicates and deal with them before running the script, or it will just fail?
if you do get a duplicate then do you skip it, or update the data (MERGE)?
why do you want this as a stored procedure? I mean it's easy enough to make into one, but I don't see why this would need to be repeatable... without seeing the other "moving parts" in this system anyway.
I'm guessing that you are loading bulk data into that temporary table somehow, from an Excel workbook, or XML, or similar. So all you want is a way to "tear the data up" into multiple tables. If this is indeed the case, then using a tool like SSIS might be more practical?
Okay, so that's 90% there, but you need two other things:
cope with situations where the product id already exists - don't try to insert it a second time as it will fail;
where the product id already exists then update the price data.
This should handle the first tweak:
INSERT INTO Product (ProductId, ProductName, ProductUrl) SELECT t.TempProductId, t.TempProductName, t.TempProductUrl FROM tempDataTable t
WHERE NOT EXISTS (SELECT * FROM Product p WHERE p.ProductId = t.TempProductId);
...and to UPDATE prices where the data already exists, or INSERT them if they don't exist, well you can use a MERGE statement:
MERGE
ProductPricing AS [target]
USING (SELECT TempProductId, TempProductPrice, TempProductPricePerUnit, GETDATE() AS ProductPricingDate FROM tempDataTable)
AS [source] (
ProductId,
ProductPrice,
ProductPricePerUnit,
ProductPricingDate)
ON ([target].ProductId = [source].ProductId)
WHEN MATCHED THEN
UPDATE SET
ProductPrice = [source].ProductPrice,
ProductPricePerUnit = [source].ProductPricePerUnit,
ProductPricingDate = [source].ProductPricingDate
WHEN NOT MATCHED THEN
INSERT (
ProductId,
ProductPrice,
ProductPricePerUnit,
ProductPricingDate)
VALUES (
[source].ProductId,
[source].ProductPrice,
[source].ProductPricePerUnit,
[source].ProductPricingDate);
Actually, re-reading your comment, I don't think you even need a MERGE (but I'm going to leave it there anyway, as it took me a little effort to write it).
I think your second case is as simple as just letting the second INSERT always run. There's two scenarios:
if there's already an entry for that product - then just add a new row to the ProductPricing table, so you will have one product, and two (or more) prices, each with a different date;
if it's a new product - then add the product and the price, so you will have one product and one price (until a new price arrives).
...and I can't resist adding, this is because you are using a natural key, i.e. a key from your data, so it doesn't change as you load it. If you were using a surrogate key (e.g. an IDENTITY that you got when you inserted the Product) then this wouldn't work, you would need to go and look up the surrogate key, then use this so your foreign key constraint worked properly. It's probably best to not think about this too hard?

How to automatically create and link a record in table B after insert in table A

Consider this scenario: I have a table Person which has a link to table Address as a one-to-many relationship (omitting constraints here for brevity):
CREATE TABLE Person (
Id UNIQUEIDENTIFIER NOT NULL,
AddressId UNIQUEIDENTIFIER NULL
)
CREATE TABLE Address (
Id UNIQUEIDENTIFIER NOT NULL,
Street NVARCHAR(100) NULL,
City NVARCHAR(100) NULL,
ZipCode NVARCHAR(20) NULL
)
Now, if a record is inserted into the Person table, I would like to automatically create an empty record in the Address table (if the AddressId column is NULL) and link that new Address to the new Person record. So in other words, I want to create a record in the Address table and want to update Person.AddressId for every inserted Person without an Address.
Since I am accessing the database from different applications using different ORMs and different business classes, I do not want to add that functionality in the business classes (multiple times), but rather in a DB trigger on the Person table.
Is it good practice to do that in a DB trigger?
What is the best implementation for the trigger ("best" meaning a good tradeoff between performance and readability)?
I could implement the trigger using a WHILE loop iterating over all records in inserted and then adding a record for each of the inserted Persons - if they have no Address assigned yet. However, this does not feel like the right approach if lots of Person records are created in a bulk operation (like a big import for example). Is it possible to perform this in one SQL statement with better performance?
You may try using an after insert trigger:
CREATE TRIGGER personInsTrigger ON Person AFTER INSERT
AS
BEGIN
INSERT INTO Address (Id, Street, City, ZipCode)
SELECT AddressId, NULL, NULL, NULL
FROM INSERTED
END
I would also recommend that you consider setting up proper foreign/primary key constraints between the two tables. Address.Id would be a primary key, and Person.AddressId would be a foreign key.

Setting up a relational DB for event logging

I am a bit rusty with my SQL since I have not worked with it beyond basic querying of existing databases that were already setup.
I am trying to create an event logging database, and want to take a "extreme" approach to normalization. I would have a main table comprised of mostly 'smallint' fields that point to child tables which contain strings.
Example:
I have an external system that i would like to enable some logging in via SQL, user fills in some key parameters which build and insert/update statement and gets pushed to the logging tables so they can be viewed at a later time if they need to know what XYZ value was at runtime, or sometime in the past.
I have a main table which consists of:
SELECT [log_id] - bigint (auto-increment) PK
,[date_time] - smalldatetime
,[cust_id] - smallint FK
,[recloc] - char(8)
,[alert_level] - smallint FK
,[header] - varchar(100)
,[body] - varchar(1000)
,[process_id] - smalint FK
,[routine_id] - smallint FK
,[workflow_id] - smallint FK
FROM [EventLogs].[dbo].[eventLogs]
All of the 'smallint' field point to a child table which contains the expanded data:
Example:
SELECT [routine_id] PK/FK
,[routine_name]
,[description]
FROM [EventLogs].[dbo].[cpRoutine]
SELECT [process_id] PK/FK
,[process_name]
,[description]
FROM [EventLogs].[dbo].[cpProcess]
My goal here, is to have the external system do an update/insert statement that reaches all these tables. I have all the 'smallint' fields linked up as FK's currently.
How do i go about crafting the update/insert statements that touches all these tables? If a child table already contains a key-value pair, i do not want to touch it. The idea of the child tables is to house repetitive data there and assign it a key in the main logging table to keep size down. Do i need to check for existence of a records in child tables, save the index number, then build my insert statement for the main table? Trying to be as efficient as possible here.
Example:
I want to log the following from the external system:
- date_time - GETDATE()
- customer_number - '0123456789'
- recloc - 'ABC123'
- alert_level - 'info'
- header - 'this is a header'
- body - 'this is a body'
- process_name - 'the process'
- routine_name - 'the routine'
- workflow_name - 'the workflow'
Do I need to create my insert statement for the main table (eventLogs) but check each child table first and add missing values, then save the id for my insert statement in the main table?
Select process_id, process_name From cpProcess where process_name = 'the process'
If no values returned, do an insert statement with the process_name
Now query the table again to get the ID so i can build the "main insert statement" that feeds the master log table
Repeat for all other child tables
final insert statement looks something like:
SQL code:
INSERT INTO eventLogs (date_time, cust_id, recloc, alert_level, header, body, process_id, routine_id, workflow_id)
VALUES('2017-12-31', '1', 'ABC123', '3', 'this is a header', 'this is a body', '13', '19', '12')
It just seems like i am doing too much back and forth with the server checking for values in the child tables to do my insert....
The end goal here is to create a friendly view that pulls in all the data assigned to the 'smallint' keys.
You're close:
Select process_id from cpProcess where process_name = 'the process'
If no values returned, do an insert statement with the process_name, get ID through IDENT_CURRENT, SCOPE_IDENTITY, or IDENTITY (or use a subordinate "load" procedure and get the ID from an output parameter).
Repeat for each child table until you get the values required to do your final insert into [eventLogs].
This works fine if it is a relatively low speed process. As you increase the speed you can have issues, but if you are doing INSERT only, as you should, it still isn't terrible. I've used SQL Server Service Broker in the past to decouple processes such as these to improve performance, but that obviously adds complexity.
Depending on the load you might also decide to build aggregate tables in a fact/dimension star so that the INSERT OLTP process is segregated from the SELECT OLAP process.
What you're seeing is the complexity involved in building a normalized data structure. You're approach "to take a "extreme" approach to normalization" is often bypassed because it's "too hard". That doesn't mean you shouldn't do it, but you should weigh the ROI. I have made decisions to just dump everything into a log table such as this below in the past where there were only going to be perhaps less than ten thousand records at any given time. You just have to look at the requirements and make the best choice.
CREATE TABLE [log].[data]
(
[id] INT IDENTITY(1, 1)
, [timestamp] DATETIME DEFAULT sysdatetime()
, [entry] XML NOT NULL
);
One option that I frequently use during the build out phase of a design is to build placeholders behind adapters as shown below. Use the getter and setter methods ALWAYS and later, when you need better performance or data storage, you can refactor the underlying data structure as required, modify the adapters to the new data structures, and you've saved yourself some time. Otherwise you can end up chasing a lot of rabbits down holes early in the project. Often you'll find that your design for the underlying structures changes based on requirements as the project moves forward and you'd have spent a lot of time on changes. Using this approach you get a working mechanism in place immediately.
Later on if you need to collapse this structure to provide better performance it will be trivial compared to constantly changing the structure during design (in my opinion).
Oh, and yes, you could use a standard relational table. I use a lot of XML in applications and event logging because it allows ad hoc structured data. The concept is the same. You could use your top level table, just with the [process_name], etc. columns directly in the table and no child columns for now.
Just remember you should NOT allow access to the underlying tables directly! One way to prevent this is to actually put them in a dedicated schema such as [log_secure], and secure that schema to all but admin and the accessor/mutator methods.
IF schema_id(N'log') IS NULL
EXECUTE (N'CREATE SCHEMA log');
go
IF object_id(N'[log].[data]', N'U') IS NOT NULL
DROP TABLE [log].[data];
go
CREATE TABLE [log].[data]
(
[id] BIGINT IDENTITY(1, 1)
, [timestamp] DATETIMEOFFSET NOT NULL -- DATETIME if timezone isn't needed
CONSTRAINT [log__data__timestamp__df] DEFAULT sysdatetimeoffset()
, [entry] XML NOT NULL,
CONSTRAINT [log__data__id__pk] PRIMARY KEY CLUSTERED ([id])
);
IF object_id(N'[log].[get_entry]', N'P') IS NOT NULL
DROP PROCEDURE [log].[get_entry];
go
CREATE PROCEDURE [log].[get_entry] #id BIGINT
, #entry XML output
, #begin DATETIMEOFFSET
, #end DATETIMEOFFSET
AS
BEGIN
SELECT #entry
FROM [log].[data]
WHERE [id] = #id;
END;
go
IF object_id(N'[log].[set_entry]', N'P') IS NOT NULL
DROP PROCEDURE [log].[set_entry];
go
CREATE PROCEDURE [log].[set_entry] #entry XML
, #timestamp DATETIMEOFFSET = NULL
, #id BIGINT output
AS
BEGIN
INSERT INTO [log].[entry]
([timestamp]
, [entry])
VALUES ( COALESCE(#timestamp, sysdatetimeoffset()),#entry );
SET #id = SCOPE_IDENTITY();
END;
go

Database Performance and maintenance with one thousand columns.

I need to create a table with one thousand fields(columns) and I don't know how to handle the performance and how to maintain it please help me with suggestions.
If most times most values are NULL then you should upgrade to SQL Server 2008 and use sparse columns, see Using Sparse Columns and Using Column Sets.
If your column values are not mostly NULL then I question the soundness of your data model.
First things you will have to do
Normalize
Define the entities and separate out to different tables. Draw an ER diagram and
you will get more ideas.
Dont exceed Table column beyond 15 if the columns are varchar or text because
then SQL will have to store the data in different pages. If the columns are
boolean then it can be around 30.
Define clustered index properly based on your data as this will optimize querying.
Since the nature of the question is without much details the answers are also generic and from a 100 feet view.
Again, please don't do this.
Check out the Entity Attribute Value model with respect to databases. It will help you store a large amount of sparse attributes on an entity and doesn't make databases cry.
The basic concept is shown below
create table #attributes
(
id int identity(1,1),
attribute varchar(20),
attribute_description varchar(max),
attribute_type varchar(20)
)
insert into #attributes values ('Column 1','what you want to put in column 1 of 1000','string')
insert into #attributes values ('Column 2','what you want to put in column 2 of 1000','int')
create table #entity
(
id int identity(1,1),
whatever varchar(max)
)
insert into #entity values ('Entity1')
insert into #entity values ('Entity2')
create table #entity_attribute
(
id int identity(1,1),
entity_id int,
attribute_id int,
attribute_value varchar(max)
)
insert into #entity_attribute values (1,1,'e1value1')
insert into #entity_attribute values (1,2,'e1value2')
insert into #entity_attribute values (2,2,'e2value2')
select *
from #entity e
join #entity_attribute ea on e.id = ea.entity_id
The difference between what goes in the #entity table and what goes in the #attribute table is somewhat dependent on the application but a general rule would be something that is never null and is accessed every time you need the entity, I would limit this to 10 or so items.
Let me guess this is a medical application?

How to store the following SQL data optimally in SQL Server 2008

I am creating a page where people can post articles. When the user posts an article, it shows up on a list, like the related questions on Stack Overflow (when you add a new question). It's fairly simple.
My problem is that I have 2 types of users. 1) Unregistered private users. 2) A company.
The unregistered users needs to type in their name, email and phone. Whereas the company users just needs to type in their company name/password. Fairly simple.
I need to reduce the excess database usage and try to optimize the database and build the tables effectively.
Now to my problem in hand:
So I have one table with the information about the companies, ID (guid), Name, email, phone etc.
I was thinking about making one table called articles that contained ArticleID, Headline, Content and Publishing date.
One table with the information about the unregistered users, ID, their name, email and phone.
How do i tie the articles table to the company/unregistered users table. Is it good to make an integer that contains 2 values, 1=Unregistered user and 2=Company and then one field with an ID-number to the specified user/company. It looks like you need a lot of extra code to query the database. Performance? How could i then return the article along with the contact information? You should also be able to return all the articles from a specific company.
So Table company would be:
ID (guid), company name, phone, email, password, street, zip, country, state, www, description, contact person and a few more that i don't have here right now.
Table Unregistered user:
ID (guid), name, phone, email
Table article:
ID (int/guid/short guid), headline, content, published date, is_company, id_to_user
Is there a better approach?
Qualities that I am looking for is: Performance, Easy to query and Easy to maintain (adding new fields, indexes etc)
Theory
The problem you described is called Table Inheritance in data modeling theory. In Martin Fowler's book the solutions are:
single table inheritance: a single table that contains all fields.
class table inheritance: one table per class, with table for abstract classes.
concrete table inheritance: one table per non-abstract class, abstract members are repeated in each concrete table
So from a theory and industry practice point of view all three solutions are acceptable: one table Posters with columns NULLable columns (ie. single table), three tables Posters, Companies and Persons (ie. class inheritance) and two tables Companies and Persons (ie. concrete inheritance).
Now, to pros and cons.
Cost of NULL columns
The record structure is discussed in Inside the Storage Engine: Anatomy of a record:
NULL bitmap
two bytes for count of columns in the record
variable number of bytes to store one bit per column in the
record, regardless of whether the
column is nullable or not (this is
different and simpler than SQL Server
2000 which had one bit per nullable
column only)
So if you have at least one NULLable column, you pay the cost of the NULL bitmap in each record, at least 3 bytes. But the cost is identical if you have 1 or 8 columns! The 9th NULLable column will add a byte to the NULL bitmap in each record. the formula is described in Estimating the Size of a Clustered Index: 2 + ((Num_Cols + 7) / 8)
Peformance Driving Factor
In database system there is really only one factor that drives performance: amount of data scanned. How large are the record scanned by a query plan, and how many records does it have to scan. So to improve the performance you need to:
narrow the records: reduce the data size, covering include indexes, vertical partitioning
reduce the number of records scanned: indexes
reduce the number of scans: eliminate joins
Now in order to analyze these criteria, there is something missing in your post: the prevalent data access pattern, ie. the most common query that the database will be hit with. This is driven by how you display your posts on the site. Consider these possible approaches:
posts front page: like SO, a page of recent posts with header, excerpt, time posted and author basic information (name, gravatar). To get this page displayed you need to join Posts with authors, but you only need the author name and gravatar. Both single table inheritance and class table inheritance would work, but concrete table inheritance would fail. This is because you cannot afford for such a query to do conditional joins (ie. join the articles posted to either Companies or Persons), such a query will be less than optimal.
posts per author: users have to login first and then they'll see their own posts (this is common for non-public post oriented sites, think incident tracking for instance). For such a design, all three table inheritance schemes would work.
Conclusion
There are some general performance considerations (ie. narrow the data) to consider, but the critical information is missing: how are you going to query the data, your access pattern. The data model has to be optimized for that access pattern:
Which fields from Companies and Persons will be displayed on the landing page of the site (ie. the most often and performance critical query) ? You don't want to join 5 tables to show those fields.
Are some Company/Person information fields only needed on the user information page? Perhaps partition the table vertically into CompaniesExtra and PersonsExtra tables. Or use a index that will cover the frequently used fields (this approach simplifies code and is easier to keep consistent, at the cost of data duplication)
PS
Needless to say, don't use guids for ids. Unless you're building a distributed system, they are a horrible choice for reasons of excessive width. Fragmentation is also a potential problem, but that can be alleviated by use of sequential guids.
Ideally if you could use ORM (as mentioned by TFD), I would do so. Since you have not commented on that as well as you always come back with the "performance" question, I assume you would not like to use one.
Using pure SQL, the approach I would suggest would be to have table structure as below:
ActicleOwner [ID (guid)]
Company [ID (guid) - PK as well as FK to ActicleOwner.ID,
company name, phone, email, password, street, zip, ...]
UnregisteredUser [ID (guid) - PK as well as FK to ActicleOwner.ID,
name, phone, email]
Article = [ID (int/guid/short guid), headline, content, published date,
ArticleOwnerID - FK to ActicleOwner.ID]
Lets see usages:
INSERT: overhead is the need to add a row to ActicleOwner table for each Company/UU. This is not the operation that happens so often, there is no need to optimize performance
SELECT:
Company/UU: well, it is easy to search for both UU and Company, since you do not need to JOIN to any other table, as all the info about the required object is in one table
Acticles of one Company/UU: again, you just need to filter on the GUID of the Company/UU, and there you go: SELECT (list fields) FROM Acticle WHERE ArticleOwnerID = #AOID
Also think that one day you might need to support multiple Owners in the Article. With the parent table approach above (or mentioned by Vincent) you will just need to introduce relation table, whereas with 2 NULL-able FK constraints to each Owner table is solution you are kind-of stuck.
Performance:
Are you sure you have performance problem? What is your target?
One thing I can recommend looking at you model regarding performance is not to use GUIDs as clustered index (which is the default for a PK). Because basically your INSERT statements will be inserting data randomly into the table.
Alternatives are:
use Sequential GUID instead (see: What are the performance improvement of Sequential Guid over standard Guid?)
use both INTEGER and GUID. This is someone complicated approach and might be an overkill for a simple model you have, but the result is that you always JOIN tables in SELECTs on INTEGER instead of GUID, which is much faster.
So if you are so hot on performance, you might try to do the following:
ActicleOwner (ID (int identity) - PK, UID (guid) - UC)
Company [ID (int) - PK as well as FK to ActicleOwner.ID,
UID (guid) - UC as well as FK to ActicleOwner.UID, company name, ...]
...
Article = [ID (int/guid/short guid), headline, content, published date,
ArticleOwnerID - FK to ActicleOwner.ID (int)]
To INSERT a user (Company/UU) you do the following:
Having a UID (maybe sequential one) from the code, you do INSERT into ActicleOwner table. You get back the autogenerated integer ID.
you insert all the data into Company/UU, including the integer ID that you have just received.
ActicleOwner.ID will be integer, so searching on it will be faster then on UID, especially when you have an index on it.
This is a common OO programming problem that should not be solved in the SQL domain. It should be handled by your ORM
Make two classes in your program code as required and let you ORM map them to a suitable SQL representation. For performance a single table with nulls will do, the only overhead is the discriminator column
Some examples hibernate inheritance
I would suggest the super-type Author for Person and Organization sub-types.
Note that AuthorID serves as the primary and the foreign key at the same time for Person and Organization tables.
So first let's create tables:
CREATE TABLE Author(
AuthorID integer IDENTITY NOT NULL
,AuthorType char(1)
,Phone varchar(20)
,Email varchar(128) NOT NULL
);
ALTER TABLE Author ADD CONSTRAINT pk_Author PRIMARY KEY (AuthorID);
CREATE TABLE Article (
ArticleID integer IDENTITY NOT NULL
,AuthorID integer NOT NULL
,DatePublished date
,Headline varchar(100)
,Content varchar(max)
);
ALTER TABLE Article ADD
CONSTRAINT pk_Article PRIMARY KEY (ArticleID)
,CONSTRAINT fk1_Article FOREIGN KEY (AuthorID) REFERENCES Author(AuthorID) ;
CREATE TABLE Person (
AuthorID integer NOT NULL
,FirstName varchar(50)
,LastName varchar(50)
);
ALTER TABLE Person ADD
CONSTRAINT pk_Person PRIMARY KEY (AuthorID)
,CONSTRAINT fk1_Person FOREIGN KEY (AuthorID) REFERENCES Author(AuthorID);
CREATE TABLE Organization (
AuthorID integer NOT NULL
,OrgName varchar(40)
,OrgPassword varchar(128)
,OrgCountry varchar(40)
,OrgState varchar(40)
,OrgZIP varchar(16)
,OrgContactName varchar(100)
);
ALTER TABLE Organization ADD
CONSTRAINT pk_Organization PRIMARY KEY (AuthorID)
,CONSTRAINT fk1_Organization FOREIGN KEY (AuthorID) REFERENCES Author(AuthorID);
When inserting into Author you have to capture the auto-incremented id and then use it to insert the rest of data into person or organization, depending on AuthorType. Each row in Author has only one matching row in Person or Organization, not in both. Here is an example of how to capture the AuthorID.
-- Insert into table and return the auto-incremented AuthorID
INSERT INTO Author ( AuthorType, Phone, Email )
OUTPUT INSERTED.AuthorID
VALUES ( 'P', '789-789-7899', 'dudete#mmahoo.com' );
Here are a few examples of how to query authors:
-- Return all authors (org and person)
SELECT *
FROM dbo.Author AS a
LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID ;
-- Return all-organization authors
SELECT *
FROM dbo.Author AS a
JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID ;
-- Return all person-authors
SELECT *
FROM dbo.Author AS a
JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
And now all articles with authors.
-- Return all articles with author information
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID ;
There are two ways to return all articles belonging to organizations. The first example returns only columns from the Organization table, while the second one has columns from the Person table too, with NULL values.
-- (1) Return all articles belonging to organizations
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID;
-- (2) Return all articles belonging to organizations
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID
WHERE AuthorType = 'O';
And to return all articles belonging to a specific organization, again two methods.
-- (1) Return all articles belonging to a specific organization
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID
WHERE c.OrgName = 'somecorp';
-- (2) Return all articles belonging to a specific organization
SELECT *
FROM dbo.Article AS x
JOIN dbo.Author AS a ON a.AuthorID = x.AuthorID
LEFT JOIN dbo.Person AS p ON a.AuthorID = p.AuthorID
LEFT JOIN dbo.Organization AS c ON c.AuthorID = a.AuthorID
WHERE c.OrgName = 'somecorp';
To make queries simpler, you could package some of this into a view or two.
Just as a reminder, it is common for an article to have several authors, so a many-to-many table Article_Author would be in order.
My preference is to use a table that acts like a super table to both.
ArticleOwner = (ID (guid), company name, phone, email)
company = (ID, password)
unregistereduser = (ID)
article = (ID (int/guid/short guid), headline, content, published date, owner)
Then querying the database will require a JOIN on the 3 tables but this way you do not have the null fields.
I'd suggest instead of two tables create one table Poster.
It's ok to have some fields empty if they are not applicable to one kind of poster.
Poster:
ID (guid), type, name, phone, email, password
where type is 1 for company, 2 - for unregistered user.
OR
Keep your users and companies separate, but require each company to have a user in users table. That table should have a CompanyID field. I think it would be more logical and elegant.
An interesting approach would be to use the Node model followed by Drupal, where everything is effectively a Node and all other data is stored in a secondary table. It's highly flexible and as is evidenced by the widespread use of Drupal in large publishing and discussion sites.
The layout would be something like this:
Node
ID
Type (User, Guest, Article)
TypeID (PKey of related data)
Created
Modified
Article
ID
Field1
Field2
Etc.
User
ID
Field1
Field2
Etc.
Guest
ID
Field1
Field2
Etc.
It's an alternative option with some good benefits. The greatest being flexibility.
I'm not convinced you need to distinguish between companies and persons; only registered and unregistered authors.
I added this for clarity. You could simply use a check constraint on the Authors table to limit the values to U and R.
Create Table dbo.AuthorRegisteredStates
(
Code char(1) not null Primary Key Clustered
, Name nvarchar(15) not null
, Constraint UK_AuthorRegisteredState Unique ( [Name])
)
Insert dbo.AuthorRegisteredState(Code, Name) Values('U', 'Unregistered')
Insert dbo.AuthorRegisteredState(Code, Name) Values('R', 'Registered')
GO
The key in any database system is data integrity. So, we want to ensure that usernames are unique and, perhaps, that Names are unique. Do you want to allow two people with the same name to publish an article? How would the reader differentiate them? Notice that I don't care whether the Author represents a company or person. If someone is registering a company or a person, they can put in a first name and last name if they want. However, what is required is that everyone enter a name (think of it as a display name). We would never search for authors based on anything other than name.
Create Table dbo.Authors
(
Id int not null identity(1,1) Primary Key Clustered
, AuthorStateCode char(1) not null
, Name nvarchar(100) not null
, Email nvarchar(300) null
, Username nvarchar(20) not null
, PasswordHash nvarchar(50) not null
, FirstName nvarchar(25) null
, LastName nvarchar(25) null
...
, Address nvarchar(max) null
, City nvarchar(40) null
...
, Website nvarchar(max) null
, Constraint UK_Authors_Name Unique ( [Name] )
, Constraint UK_Authors_Username Unique ( [Username] )
, Constraint FK_Authors_AuthorRegisteredStates
Foreign Key ( AuthorStateCode )
References dbo.AuthorRegisteredStates ( Code )
-- optional. if you really wanted to ensure that an author that was unregistered
-- had a firstname and lastname. However, I'd recommend enforcing this in the GUI
-- if anywhere as it really does not matter if they
-- enter a first name and last name.
-- All that matters is whether they are registered and entered a name.
, Constraint CK_Authors_RegisteredWithFirstNameLastName
Check ( State = 'R' Or ( State = 'U' And FirstName Is Not Null And LastName Is Not Null ) )
)
Can a single author publish two articles on the same date and time? If not (as I've guessed here), then we add a unique constraint. The question is whether you might need to identify an article. What information might you be given to locate an article besides the general date it was published?
Create Table dbo.Articles
(
Id int not null identity(1,1) Primary Key Clustered
, AuthorId int not null
, PublishedDate datetime not null
, Headline nvarchar(200) not null
, Content nvarchar(max) null
...
, Constraint UK_Articles_PublishedDate Unique ( AuthorId, PublishedDate )
, Constraint FK_Articles_Authors
Foreign Key ( AuthorId )
References dbo.Authors ( Id )
)
In addition, I would add an index on PublishedDate to improve searches by date.
Create Index IX_Articles_PublishedDate dbo.Articles On ( PublishedDate )
I would also enable free text search to search on the contents of articles.
I think concerns about "empty space" are probably premature optimization. The effect on performance will be nil. This is a case where a small amount of denormalizing costs you nothing in terms of performance and gains you in terms of development. However, if it really concerned you, you could move the address information into 1:1 table like so:
Create Table dbo.AuthorAddresses
(
AuthorId int not null Primary Key Clustered
, Street nvarchar(max) not null
, City nvarchar(40) not null
...
, Constraint FK_AuthorAddresses_Authors
Foreign Key ( AuthorId )
References dbo.Authors( Id )
)
This will add a small amount of complexity to your middle-tier. As always, the question is whether the elimination of some empty space exceeds the cost in terms of coding and testing. Whether you store this information as columns in your Authors table or in a separate table, the effect on performance will be nil.
I have solved similar problems by an approach similar to this:
Company -> Company
Articles User -> UserArticles
Articles
CompanyArticles contains a mapping from Company to an Article
UserArticles contains a mapping from User to Article
Article doesn't know anything about who created it.
By inverting the dependencies here you end up not overloading the meaning of foreign keys, having unused foreign keys, or creating a super table.
Getting all articles and contact information would look like:
SELECT name, phone, email FROM
user
JOIN userarticles on user.user_id = userarticles.user_id
JOIN articles on userarticles.article_id = article.article_id
UNION
SELECT name, phone, email FROM
company
JOIN companyarticles on company.company_id = companyarticles.company_id
JOIN articles on companyarticles.article_id = article.article_id

Resources