Should I normalize this database design further? - sql-server

I have the following database design:
TABLE [Document]
[DocumentId] [int] NOT NULL, --Primary Key
[Status] [bit] NULL,
[Text] [nvarchar](max) NULL,
[FolderPath] [nvarchar](max) NULL
TABLE [Metadata]
[MetadataId] [int] IDENTITY(1,1) NOT NULL, -- Primary Key
[DocumentId] [int] NOT NULL, -- Foreign Key Document.DocumentId (1:1 relationship)
[Title] [nvarchar](250) NOT NULL,
[Author] [nvarchar](250) NOT NULL
TABLE [Page](
[PageId] [int] IDENTITY(1,1) NOT NULL, -- Primary Key
[DocumentId] [int] NOT NULL, -- Foreign Key Document.DocumentId (1:N Relationship)
[Number] [int] NOT NULL,
[ImagePath] [nvarchar](max) NULL,
[PageText] [nvarchar](max) NOT NULL
TABLE [Word](
[WordId] [int] IDENTITY(1,1) NOT NULL, -- Primary Key
[PageId] [int] NOT NULL, -- Foreign Key Page.PageId (1:N Relationship)
[Text] [nvarchar](50) NOT NULL
TABLE [Keyword](
[KeywordId] [int] IDENTITY(1,1) NOT NULL, -- Primary Key
[Word] [nvarchar](50) NOT NULL
TABLE [DocumentKeyword](
[Document_DocumentId] [int] NOT NULL, -- Foreign Key Document.DocumentId (N:N Relationship)
[Keyword_KeywordId] [int] NOT NULL -- Foreign Key Keyword.KeywordId
I'm using Entity Framework Code First to create the database.
Should I be normalizing my database design further? i.e. creating link tables between Document and Page, Document and Metadata, etc.? If so, is there a way to get the Entity Framework to do create the relationship tables for me, so that I don't have to include them in my models? I'm trying to learn to do this the right and most efficient way possible.
Thank you.

Well I can't immediately answer your question, but I have some thoughts that might improve your design:
A document (in real life, at least) can be written by more than one
author. This means, that your 1:1 relationship from Document to
Metadata should be a 1:n relationship (unless you can prove that
there will never be a situation that there's more than one author)
The title of a document is (in my view) more a property of the document than a piece of metadata (also having 1. in mind)
What does this Word table do?
The column Keyword_KeywordId should be called plainly KeywordId if you want to be consistent in your naming. The same applies to
Document_DocumentId.
For the rest it looks pretty normalized

Related

Best performance design with time-series in sql-server

(TL;DR)
The problem to solve with the design:
fast retrieval of related time-series with different frequency.
The tool:
A sql server table and index design.
The longer version:
I wish to calculate different functions at one or mere specific times or intervals with input data from time-series with different resolutions. And my intuition tells me that I need to think extra about the table/index design, given that the object is to have a fast join of the rows.
The designs advice I have seen so far is mostly concerned with retrieving a single time-series vs the problem a hand here, retrieve values from different time-series at the same point of time. Table design for multiple time series data
My purposed overall design, is the following:
CREATE TABLE [dbo].[time_series_definition](
[ID] [int] IDENTITY(1,1) NOT NULL,
[data_type_description] [nvarchar](100) NULL,
[duration_sec] [int] NOT NULL,
CONSTRAINT [PK_time_series_definition] PRIMARY KEY CLUSTERED
(
[ID] ASC
))
CREATE TABLE [dbo].[time_series](
[ID] [int] IDENTITY(1,1) NOT NULL,
[start_date] [date] NOT NULL,
[end_date] [date] NOT NULL,
[time_series_definition_ID] [int] NOT NULL,
[source] [nchar](30) NULL,
[description] [nvarchar](100) NULL,
[update_time] [datetime2](0) NOT NULL,
CONSTRAINT [PK_time_series] PRIMARY KEY CLUSTERED
(
[ID] ASC
))
ALTER TABLE [dbo].[time_series] WITH CHECK ADD CONSTRAINT [FK_time_series_time_series_definition] FOREIGN KEY([time_series_definition_ID])
REFERENCES [dbo].[time_series_definition] ([ID])
CREATE TABLE [dbo].[data_values](
[ID] [int] IDENTITY(1,1) NOT NULL,
[date_time] [datetime2](0) NOT NULL,
[time_series_ID] [int] NOT NULL,
[value] [decimal](19, 8) NULL,
CONSTRAINT [PK_data_values] PRIMARY KEY CLUSTERED
(
[ID] ASC
))
ALTER TABLE [dbo].[data_values] WITH CHECK ADD CONSTRAINT [FK_data_values_time_series] FOREIGN KEY([time_series_ID])
REFERENCES [dbo].[time_series] ([ID])
The values [start_date], [end_date] are redundant, but believe that the might improve query speed, when the start/end of the series is know prior to lookup in the [data_values] table.
The [duration_sec] is to save space in [data_values] table since the series are evenly space within a specific series.
So given this design what is the best index/partition strategy to enable fast lookup of different series at a given time or time-interval.

I do not know what's wrong with my database. Displays errors

I created two tables. And I need to have two Ids in the second table of Friends. UserId is Id from the UserInformation table. FriendId is Id already in this table. I need to make a relationship between these two tables and correctly make FK and PK. I tried to do it myself but it screams what I did not do right here - REFERENCES [UserInformation] (UserId)
I need your help to do the job correctly
CREATE TABLE [dbo].[UserInformation]
(
[Id] [INT] IDENTITY(1,1) NOT NULL,
[Login] [VARCHAR](50) NOT NULL,
[Password] [VARCHAR](50) NOT NULL,
[FirstName] [NCHAR](10) NOT NULL,
[LastName] [NCHAR](10) NOT NULL,
[Email] [VARCHAR](50) NOT NULL,
[RegistrationDate] [DATETIME] NOT NULL,
[Groups] [VARCHAR](50) NOT NULL
)
GO
CREATE TABLE [dbo].[Friends]
(
[UserId] [INT] NOT NULL,
[FriendId] [INT] NOT NULL,
PRIMARY KEY (FriendId),
CONSTRAINT FK_UserInformationFriend
FOREIGN KEY (UserId) REFERENCES [UserInformation](UserId)
)
GO
ALTER TABLE [dbo].[UserInformation]
ADD CONSTRAINT [DF_UserInformation_RegistrationDate]
DEFAULT (GETDATE()) FOR [RegistrationDate]
GO
ALTER TABLE UserInformation
ADD CONSTRAINT DF_UserInformation_Login_Unique UNIQUE(Login)
GO
ALTER TABLE UserInformation
ADD CONSTRAINT DF_UserInformation_Email_Unique UNIQUE(Email)
GO
ALTER TABLE UserInformation
ADD CONSTRAINT [PK_UserInformation] PRIMARY KEY ([Id])
GO
ALTER TABLE Friends
ADD CONSTRAINT [PK_Friends] PRIMARY KEY ([UserId])
GO
First Change it as:
CREATE TABLE [dbo].[UserInformation](
[Id] [int] IDENTITY(1,1) PRIMARY KEY NOT NULL,
[Login] [varchar](50) NOT NULL,
[Password] [varchar](50) NOT NULL,
[FirstName] [nchar](10) NOT NULL,
[LastName] [nchar](10) NOT NULL,
[Email] [varchar](50) NOT NULL,
[RegistrationDate] [datetime] NOT NULL,
[Groups] [varchar](50) NOT NULL
)
then:
CREATE TABLE [dbo].[Friends](
[UserId] [int] NOT NULL,
[FriendId] [int] NOT NULL,
PRIMARY KEY (FriendId),
CONSTRAINT FK_UserInformationFriend FOREIGN KEY (UserId)
REFERENCES [UserInformation](Id)
)
and Last:
ALTER TABLE [dbo].[UserInformation]
ADD CONSTRAINT [DF_UserInformation_RegistrationDate] DEFAULT
(getdate()) FOR [RegistrationDate]
GO
ALTER TABLE UserInformation
ADD CONSTRAINT DF_UserInformation_Login_Unique UNIQUE (Login)
GO
ALTER TABLE UserInformation
ADD CONSTRAINT DF_UserInformation_Email_Unique UNIQUE (Email)
No need these as primary is already set for both tables:
ALTER TABLE UserInformation
ADD CONSTRAINT [PK_UserInformation] PRIMARY KEY ([Id])
GO
ALTER TABLE Friends
ADD CONSTRAINT [PK_Friends] PRIMARY KEY ([UserId])
GO
Note: if you need multiple primary keys then go for composite primarys:
primary key (FriendId, UserId)

SQL Server database design for high volume stock market price data

I am writing application to store and retrieve stock market price data which the data is inserted on daily basis. I am storing the data for each asset (Stock) and for most of the market in the world. This is my current design of the tables
Country table:
CREATE TABLE [dbo].[List_Country]
(
[CountryId] [char](2) NOT NULL,
[Name] [nvarchar](100) NOT NULL,
[CurrenyCode] [nvarchar](5) NULL,
[CurrencyName] [nvarchar](50) NULL
CONSTRAINT [PK_dbo.List_Country]
PRIMARY KEY CLUSTERED ([CountryId] ASC)
)
Asset table:
CREATE TABLE [dbo].[List_Asset]
(
[AssetId] [int] IDENTITY(1,1) NOT NULL,
[Name] [nvarchar](max) NOT NULL,
[CountryId] [char](2) NOT NULL,
CONSTRAINT [PK_dbo.List_Asset]
PRIMARY KEY CLUSTERED ([AssetId] ASC)
)
Foreign key constraint on Country:
ALTER TABLE [dbo].[List_Asset] WITH CHECK
ADD CONSTRAINT [FK_dbo.List_Asset_dbo.List_Country_CountryId]
FOREIGN KEY([CountryId])
REFERENCES [dbo].[List_Country] ([CountryId])
ON DELETE CASCADE
GO
Stock_Price table:
CREATE TABLE [dbo].[Stock_Price_Data]
(
[StockPriceDataId] [int] IDENTITY(1,1) NOT NULL,
[AssetId] [int] NOT NULL,
[PriceDate] [datetime] NOT NULL,
[Open] [int] NOT NULL,
[High] [int] NOT NULL,
[Low] [int] NOT NULL,
[Close] [int] NOT NULL,
[Volume] [int] NOT NULL,
CONSTRAINT [PK_dbo.Stock_Price_Data]
PRIMARY KEY CLUSTERED ([StockPriceDataId] ASC)
)
Foreign key constraint on Asset:
ALTER TABLE [dbo].[Stock_Price_Data] WITH CHECK
ADD CONSTRAINT [FK_dbo.Stock_Price_Data_dbo.List_Asset_AssetId]
FOREIGN KEY([AssetId])
REFERENCES [dbo].[List_Asset] ([AssetId])
ON DELETE CASCADE
The concern I have at the moment is Stock_Price_Data table would be filled with high volume rows, i.e. For a specific market in a country, there can be easily 20,000 assets. Thus, in a year (260 days of trading) , I could potentially have 5.2 million rows for each country.
The application does not restrict a user from accessing data other than default country (which is setup during login).
Is it a good idea to have separate table (i.e. Stock_Price_Data_AU) for each country? Or is there a better way to design the database for the above scenario?
-Alan-
First of all - I'd drop the _data from the table name - its overkill.
If you are reasonably certain that the users will always filter the data by Country - ie only looking at 1 country at a time then I'd consider partitioning the table by Country ID - this way SQL Server will use partition elimination to pick only the relevant data. This way you get the ease of maintenance from 1 table but you get the performance as if it is a separate table per country. (I'm assuming you have Enterprise Edition) If your load works on a per country basis too then you can even switch out the partition and then drop the indexes to get even faster loads.

Need advice on table relations

I have a table Users:
[UserId] [int] IDENTITY(1,1) NOT NULL,
[UserName] [nvarchar](20) NOT NULL,
[Email] [nvarchar](100) NOT NULL,
[Password] [nvarchar](128) NOT NULL,
[PasswordSalt] [nvarchar](128) NOT NULL,
[Comments] [nvarchar](256) NULL,
[CreatedDate] [datetime] NOT NULL,
[LastModifiedDate] [datetime] NULL,
[LastLoginDate] [datetime] NOT NULL,
[LastLoginIp] [nvarchar](40) NULL,
[IsActivated] [bit] NOT NULL,
[IsLockedOut] [bit] NOT NULL,
[LastLockedOutDate] [datetime] NOT NULL,
[LastLockedOutReason] [nvarchar](256) NULL,
[NewPasswordKey] [nvarchar](128) NULL,
[NewPasswordRequested] [datetime] NULL,
[NewEmail] [nvarchar](100) NULL,
[NewEmailKey] [nvarchar](128) NULL,
[NewEmailRequested] [datetime] NULL
This table has 1 to 1 relation to Profiles:
[UserId] [int] NOT NULL,
[FirstName] [nvarchar](25) NULL,
[LastName] [nvarchar](25) NULL,
[Sex] [bit] NULL,
[BirthDay] [smalldatetime] NULL,
[MartialStatus] [int] NULL
I need to connect user to the all other tables in database so is it better to:
1) Make relations from Users - to other tables?
2) Make relations from Profiles - to other tables?
Since the table [Users] contains the Identity value and is therefore where the [UserID] value originates, I would create all the foreign keys back to it. From a performance standpoint, assuming you have your clustered index on both tables set on the [UserID] column there should be very little performance impact.
Technically I suppose the [Users] table could contain more data per row and therefore the index could span more pages and you could have milliseconds difference in lookups, but I think it makes more sense to relate it back to the table that created the [UserID] and is similarly named. That said, you can really do either.
If the PK of Profiles is a FK to Users, I would maintain consistency and use Users as the parent table in other relationships across the database.
However, if it is a true one-to-one and not a one-to-zero or one relationship, it doesn't matter.
Another consideration is how the data in this database is accessed by any applications. Do the applications use an OR/M like Entity Framework which is aware of FK relationships? If so, consider using whichever table has columns which will most commonly be accessed by queries based on the child tables. For example, an application might display Profiles.LastName and Profiles.FirstName all over the place and very rarely read anything from the Users table. In this situation, you will save your database some I/O and save your developers some keystrokes by building relationships off the Profiles table.

Unique row constraint in SQL Server

I have the following table
CREATE TABLE [dbo].[LogFiles_Warehouse](
[id] [int] IDENTITY(1,1) NOT NULL,
[timestamp] [datetime] NOT NULL,
[clientNr] [int] NOT NULL,
[server] [nvarchar](150) COLLATE Latin1_General_CI_AS NOT NULL,
[storeNr] [int] NOT NULL,
[account] [nvarchar](50) COLLATE Latin1_General_CI_AS NOT NULL,
[software] [nvarchar](300) COLLATE Latin1_General_CI_AS NOT NULL,
CONSTRAINT [PK_Astoria_LogFiles_Warehouse] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, IGNORE_DUP_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
And want to avoid having duplicate rows in my table. I thought about creating a UNIQUE index on the complete table, but then SQL Manager Studio tells me that this is not possible because the key would be too large.
Is there another way I could enforce unique rows over all columns, apart from indexes?
Create a UNIQUE index on hashed values:
CREATE TABLE [dbo].[LogFiles_Warehouse]
(
[id] [int] IDENTITY(1,1) NOT NULL,
[timestamp] [datetime] NOT NULL,
[clientNr] [int] NOT NULL,
[server] [nvarchar](150) COLLATE Latin1_General_CI_AS NOT NULL,
[storeNr] [int] NOT NULL,
[account] [nvarchar](50) COLLATE Latin1_General_CI_AS NOT NULL,
[software] [nvarchar](300) COLLATE Latin1_General_CI_AS NOT NULL,
serverHash AS CAST(HASHBYTES('MD4', server) AS BINARY(16)),
accountHash AS CAST(HASHBYTES('MD4', account) AS BINARY(16)),
softwareHash AS CAST(HASHBYTES('MD4', software) AS BINARY(16))
)
CREATE UNIQUE INDEX
UX_LogFilesWarehouse_Server_Account_Software
ON LogFiles_Warehouse (serverHash, accountHash, softwareHash)
Use triggers + a smaller non unique index over the most distinguishing ields to helop aleviate the table s can problem.
This goes down a lot into a bad database design to start with. Fields like Software, Account do not belong into that table to start with (or if account, then not client nr). Your table is only so wisde because you arelady violate database design basics to start with.
Also, to abvoid non unique fields, you have NT to have the Id field in the unique testing otherwise you ont ever have doubles to start with.

Resources