JOIN Performance: Composite key versus BigInt Primary Key

JOIN Performance: Composite key versus BigInt Primary Key - sql-server

We have a table that is going to be say 100 million to a billion rows (Table name: Archive)
This table will be referenced from another table, Users.
We have 2 options for the primary key on the Archive table:
option 1: dataID (bigint)
option 2: userID + datetime (4 byte version).
Schema:
Users
- userID (int)
Archive
- userID
- datetime
OR
Archive
- dataID (big int)
Which one would be faster?
We are shying away from using Option#1 because bigint is 8 bytes and with 100 million rows that will add up to allot of storage.
Update
Ok sorry I forgot to mention, userID and datetime have to be regardless, so that was the reason for not adding another column, dataID, to the table.

Some thoughts, but there is probably not a clear cut solution:
If you have a billion rows, why not use int which goes from -2.1 billion to +2.1 billion?
Userid, int, 4 bytes + smalldatetime, 4 bytes = 8 bytes, same as bigint
If you are thinking of userid + smalldatetime then surely this is useful anyway.
If so, adding a surrogate "archiveID" column will increase space anyway
Do you require filtering/sorting by userid + smalldatetime?
Make sure your model is correct, worry about JOINs later...

Concern: Using UserID/[small]datetime carries with it a high risk of not being unique.
Here is some real schema. Is this what you're talking about?
-- Users (regardless of Archive choice)
CREATE TABLE dbo.Users (
userID int NOT NULL IDENTITY,
<other columns>
CONSTRAINT <name> PRIMARY KEY CLUSTERED (userID)
)
-- Archive option 1
CREATE TABLE dbo.Archive (
dataID bigint NOT NULL IDENTITY,
userID int NOT NULL,
[datetime] smalldatetime NOT NULL,
<other columns>
CONSTRAINT <name> PRIMARY KEY CLUSTERED (dataID)
)
-- Archive option 2
CREATE TABLE dbo.Archive (
userID int NOT NULL,
[datetime] smalldatetime NOT NULL,
<other columns>
CONSTRAINT <name> PRIMARY KEY CLUSTERED (userID, [datetime] DESC)
)
CREATE NONCLUSTERED INDEX <name> ON dbo.Archive (
userID,
[datetime] DESC
)
If this were my decision, I would definitely got with option 1. Disk is cheap.
If you go with Option 2, it's likely that you will have to add some other column to your PK to make it unique, then your design starts degrading.

What's with option 3: Making dataID a 4 byte int?
Also, if I understand it right, the archive table will be referenced from the users table, so it wouldn't even make much sense to have the userID in the archive table.

I recommend that you setup a simulation to validate this in your environment, but my guess would be that the single bigint would be faster in general; however when you query the table what are you going to be querying on?
If I was building an arhive, I might lean to having an autoincrement identity field, and then using a partioning scheme to partion based on DateTime and perhaps userid but that would depend on the circumstance.

Related

Index and primary key in large table that doesn't have an Id column

I'm looking for guidance on the best practice for adding indexes / primary key for the following table in SQL Server.
My goal is to maximize performance mostly on selecting data, but also in inserts.
IndicatorValue
(
[IndicatorId] [uniqueidentifier] NOT NULL, -- this is a foreign key
[UnixTime] [bigint] NOT null,
[Value] [decimal](15,4) NOT NULL,
[Interval] [int] NOT NULL
)
The table will have over 10 million rows. Data is batch inserted between 5-10 thousand rows at a time.
I frequently query the data and retrieve the same 5-10 thousand rows at a time with SQL similar to
SELECT [UnixTime]
FROM [IndicatorValue]
WHERE [IndicatorId] = 'xxx GUID xxx'
AND [Interval] = 2
ORDER BY [UnixTime]
or
SELECT [UnixTime], [Value]
FROM [IndicatorValue]
WHERE [IndicatorId] = 'xxx GUID xxx'
AND [Interval] = 2
ORDER BY [UnixTime]
Based on my limited knowledge of SQL indexes, I think:
I should have a clustered index on IndicatorId and Interval. Because of the ORDER BY, should it also include UnixTime?
As I don't have an identity column (didn't create one because I wouldn't use it), I could have a non-clustered primary key on IndicatorId, UnixTime and Interval, because I read that it's always good to have PK on every table.
Also, the data is very rarely deleted, and there are not many updates, but when they happen it's only on 1 row.
Any insight on best practices would be much appreciated.

Dynamic SQL to execute large number of rows from a table

I have a table with a very large number of rows which I wish to execute via dynamic SQL. They are basically existence checks and insert statements and I want to migrate data from one production database to another - we are merging transactional data. I am trying to find the optimal way to execute the rows.
I've been finding the coalesce method for appending all the rows to one another to not be efficient for this particularly when the number of rows executed at a time is greater than ~100.
Assume the structure of the source table is something arbitrary like this:
CREATE TABLE [dbo].[MyTable]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[DataField1] [int] NOT NULL,
[FK_ID1] [int] NOT NULL,
[LotsMoreFields] [NVARCHAR] (MAX),
CONSTRAINT [PK_MyTable] PRIMARY KEY CLUSTERED ([ID] ASC)
)
CREATE TABLE [dbo].[FK1]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[Name] [int] NOT NULL, -- Unique constrained value
CONSTRAINT [PK_FK1] PRIMARY KEY CLUSTERED ([ID] ASC)
)
The other requirement is I am tracking the source table PK vs the target PK and whether an insert occurred or whether I have already migrated that row to the target. To do this, I'm tracking migrated rows in another table like so:
CREATE TABLE [dbo].[ChangeTracking]
(
[ReferenceID] BIGINT IDENTITY(1,1),
[Src_ID] BIGINT,
[Dest_ID] BIGINT,
[TableName] NVARCHAR(255),
CONSTRAINT [PK_ChangeTracking] PRIMARY KEY CLUSTERED ([ReferenceID] ASC)
)
My existing method is executing some dynamic sql generated by a stored procedure. The stored proc does PK lookups as the source system has different PK values for table [dbo].[FK1].
E.g.
IF NOT EXISTS (<ignore this existence check for now>)
BEGIN
INSERT INTO [Dest].[dbo].[MyTable] ([DataField1],[FK_ID1],[LotsMoreFields]) VALUES (333,(SELECT [ID] FROM [Dest].[dbo].[FK1] WHERE [Name]=N'ValueFoundInSource'),N'LotsMoreValues');
INSERT INTO [Dest].[dbo].[ChangeTracking] ([Src_ID],[Dest_ID],[TableName]) VALUES (666,SCOPE_IDENTITY(),N'MyTable'); --666 is the PK in [Src].[dbo].[MyTable] for this inserted row
END
So when you have a million of these, it isn't quick.
Is there a recommended performant way of doing this?

As mentioned, the MERGE statement works well when you're looking at a complex JOIN condition (if any of these fields are different, update the record to match). You can also look into creating a HASHBYTES hash of the entire record to quickly find differences between source and target tables, though that can also be time-consuming on very large data sets.

It sounds like you're making these updates like a front-end developer, by checking each row for a match and then doing the insert. It will be far more efficient to do the inserts with a single query. Below is an example that looks for names that are in the tblNewClient table, but not in the tblClient table:
INSERT INTO tblClient
( [Name] ,
TypeID ,
ParentID
)
SELECT nc.[Name] ,
nc.TypeID ,
nc.ParentID
FROM tblNewClient nc
LEFT JOIN tblClient cl
ON nc.[Name] = cl.[Name]
WHERE cl.ID IS NULL;
This is will way more efficient than doing it RBAR (row by agonizing row).

Taking the two answers from #RusselFox and putting them together, I reached this tentative solution (but looking a LOT more efficient):
MERGE INTO [Dest].[dbo].[MyTable] [MT_D]
USING (SELECT [MT_S].[ID] as [SrcID],[MT_S].[DataField1],[FK_1_D].[ID] as [FK_ID1],[MT_S].[LotsMoreFields]
FROM [Src].[dbo].[MyTable] [MT_S]
JOIN [Src].[dbo].[FK_1] ON [MT_S].[FK_ID1] = [FK_1].[ID]
JOIN [Dest].[dbo].[FK_1] [FK_1_D] ON [FK_1].[Name] = [FK_1_D].[Name]
) [SRC] ON 1 = 0
WHEN NOT MATCHED THEN
INSERT([DataField1],[FL_ID1],[LotsMoreFields])
VALUES ([DataField1],[FL_ID1],[LotsMoreFields])
OUTPUT [SRC].[SrcID],INSERTED.[ID],0,N'MyTable' INTO [Dest].[dbo].[ChangeTracking]([Src_ID],[Dest_ID],[AlreadyExists],[TableName]);

Recommended SQL Server table design for file import and processing

I have a scenario where files will be uploaded into a database table (dbo.FileImport) with each line of the file in a new row. Each row will contain the line data and the name of the file it came from. The file names are unique but may contain a few million lines. Multiple file's data may exist in the table at one time.
Each file is processed and the results are stored in a separate table. After processing the data related to the file, the data is deleted from the import table to keep the table from growing indefinitely.
The table structure is as follows:
CREATE TABLE [dbo].[FileImport] (
[Id] BIGINT IDENTITY (1, 1) NOT NULL,
[FileName] VARCHAR (100) NOT NULL,
[LineData] NVARCHAR (300) NOT NULL
);
During the processing the data for the relevant file is loaded with the following query:
SELECT [LineData] FROM [dbo].[FileImport] WHERE [FileName] = #FileName
And then deleted with the following statement:
DELETE FROM [dbo].[FileImport] WHERE [FileName] = #FileName
My question is pertaining to the table design with regard to performance and longevity...
Is it necessary to have the [Id] column if I never use it (I am concerned about running out of numbers in the Identity eventually too)?
Should I add a PRIMARY KEY Constraint to the [Id] column?
Should I have a CLUSTERED or NONCLUSTERED index for the [FileName] column?
Should I be making use of NOLOCK whenever I query this table (it is updated very regularly)?
Would there be concern of fragmentation with the continual adding and deleting of data to/from this table? If so, how should I handle this?
Any advice or thoughts would be much appreciated. Opinionated designs are welcome ;-)
Update 2017-12-10
I failed to mention that the lines of a file may not be unique. So please take this into account if this affects the recommendation.
An example script in the answer would be an added bonus! ;-)

Is it necessary to have the [Id] column if I never use it (I am
concerned about running out of numbers in the Identity eventually
too)?
It is not necessary to have an unused column. This is not a relational table and will not be referenced by a foreign key so one could make the argument a primary key is unnecessary.
I would not be concerned about running out of 64-bit integer values. bigint can hold a positive value of up to 36,028,797,018,963,967. It would take centuries to run out of values if you load 1 billion rows a second.
Should I add a PRIMARY KEY Constraint to the [Id] column?
I would create a composite clustered primary key on FileName and ID. That would provide an incremental value to facilitate retrieving rows in the order of insertion and the FileName leftmost key column would benefit your queries greatly.
Should I have a CLUSTERED or NONCLUSTERED index for the [FileName]
column?
See above.
Should I be making use of NOLOCK whenever I query this table (it is
updated very regularly)?
No. Assuming you query by FileName, only the rows requested will be touched with the suggested primary key.
Would there be concern of fragmentation with the continual adding and
deleting of data to/from this table? If so, how should I handle this?
Incremental keys avoid fragmentation.
EDIT:
Here's the suggested DDL for the table:
CREATE TABLE dbo.FileImport (
FileName VARCHAR (100) NOT NULL
, RecordNumber BIGINT NOT NULL IDENTITY
, LineData NVARCHAR (300) NOT NULL
CONSTRAINT PK_FileImport PRIMARY KEY CLUSTERED(FileName, RecordNumber)
);

Here is a rough sketch how I would do it
CREATE TABLE [FileImport].[FileName] (
[FileId] BIGINT IDENTITY (1, 1) NOT NULL,
[FileName] VARCHAR (100) NOT NULL
);
go
alter table [FileImport].[FileName]
add constraint pk_FileName primary key nonclustered (FileId)
go
create clustered index cix_FileName on [FileImport].[FileName]([FileName])
go
CREATE TABLE [FileImport].[LineData] (
[FileId] VARCHAR (100) NOT NULL,
[LineDataId] BIGINT IDENTITY (1, 1) NOT NULL,
[LineData] NVARCHAR (300) NOT NULLL.
constraint fk_LineData_FileName foreign key (FileId) references [FileImport].[FileName](FIleId)
);
alter table [FileImport].[LineData]
add constraint pk_FileName primary key clustered (FileId, LineDataId)
go
This is with some normalization so you don't have to reference your full file name every time - you probably don't have to do (in case you prefer not to and just move FileName to second table instead of the FileId and cluster your index on (FileName, LeneDataId)) it but since we are using relational database ...
No need for any additional indexes - tables are sorted by the right keys
Should I be making use of NOLOCK whenever I query this table (it is
updated very regularly)?
If your data means anything to you, don't use it, It's a matter in fact, if you have to use it - something really wrong with your DB architecture. The way it is indexed SQL Server will use Seek operation which is very fast.
Would there be concern of fragmentation with the continual adding and
deleting of data to/from this table? If so, how should I handle this?
You can set up a maintenance job that rebuilds your indexes and run it nightly with Agent (or what ever)

SQL Server Unique Key Constraint Date/Time Span

How can I set a unique key constraint for the following table to ensure the date/time span between the Date/BeginTime and Date/EndTime do not overlap with another record? If I need to add a computed column, what data type and calculation?
Column Name Data Type
Date date
BeginTime time(7)
EndTime time(7)
Thanks.

I don't believe that you can do that using a UNIQUE constraint in SQL Server. Postgres has this capability, but to implement it in SQL Server you must use a trigger. Since your question was "how can I do this using a unique key constraint", the correct answer is "you can't". If you had asked "how can I enforce this non-overlapping constraint", there is an answer.

Alexander Kuznetsov shows one possible way. Storing intervals of time with no overlaps.
See also article by Joe Celko: Contiguous Time Periods
Here is the table and the first interval:
CREATE TABLE dbo.IntegerSettings(SettingID INT NOT NULL,
IntValue INT NOT NULL,
StartedAt DATETIME NOT NULL,
FinishedAt DATETIME NOT NULL,
PreviousFinishedAt DATETIME NULL,
CONSTRAINT PK_IntegerSettings_SettingID_FinishedAt
PRIMARY KEY(SettingID, FinishedAt),
CONSTRAINT UNQ_IntegerSettings_SettingID_PreviousFinishedAt
UNIQUE(SettingID, PreviousFinishedAt),
CONSTRAINT FK_IntegerSettings_SettingID_PreviousFinishedAt
FOREIGN KEY(SettingID, PreviousFinishedAt)
REFERENCES dbo.IntegerSettings(SettingID, FinishedAt),
CONSTRAINT CHK_IntegerSettings_PreviousFinishedAt_NotAfter_StartedAt
CHECK(PreviousFinishedAt <= StartedAt),
CONSTRAINT CHK_IntegerSettings_StartedAt_Before_FinishedAt
CHECK(StartedAt < FinishedAt)
);
INSERT INTO dbo.IntegerSettings
(SettingID, IntValue, StartedAt, FinishedAt, PreviousFinishedAt)
VALUES(1, 1, '20070101', '20070103', NULL);
Constraints enforce these rules:
There can be only one first interval for a setting
Next window must begin after the end of the previous one
Two different windows cannot refer to one and the same window as their previous one

-- this is a unique key that allows for null in EndTime field
-- This Unique Index could be clusteres optionally instead of the traditional primary key being clustered
CREATE UNIQUE NONCLUSTERED INDEX
[UNQ_IDX_Date_BeginTm_EndTm_UniqueIndex_With_Null_EndTime] ON [MyTableName]
(
[Date] ASC,
[BeginTime] ASC,
[EndTime] ASC
)
GO
-- this is a traditional PK Constraint that is clustered but EndTime is
--- Not Null
-- it is possible that this table would not have a traditional Primary Key
ALTER TABLE dbo.MyTable ADD CONSTRAINT
PK_Date_BeginTm_EndTm_EndTimeIsNotNull PRIMARY KEY CLUSTERED
(
Date,
BeginTime,
EndTime
)
GO
-- HINT - Control your BeginTime and EndTime secconds and milliseconds at
-- all insert and read points
-- you want 13:01:42.000 and 13:01:42.333 to evaluate and compare the
-- exact way you expect from a KEY perspective

Whats the difference between Primary key in table definitions vs. unique clustered index

What is the difference between defining the PK as part of the table definition vs. adding it as a unique clustered index. Using the example below, both tables show up as index_id 1 in sys.indexes, but only table1 has is_primary_key=1
I thought this was the same, but SSMS only shows the key-symbol on table1
Thanks.
CREATE DATABASE IndexVsHeap
GO
USE [IndexVsHeap]
GO
-- Clustered index table
CREATE TABLE [dbo].[Table1](
[LogDate] [datetime2](7) NOT NULL,
[Database_Name] [nvarchar](128) NOT NULL,
[Cached_Size_MB] [decimal](10, 2) NULL,
[Buffer_Pool_Percent] [decimal](5, 2) NULL
CONSTRAINT [PK_LogDate_DatabaseName] PRIMARY KEY(LogDate, Database_Name)
)
-- Table as heap, PK-CI added later, or did i?
CREATE TABLE [dbo].[Table2](
[LogDate] [datetime2](7) NOT NULL,
[Database_Name] [nvarchar](128) NOT NULL,
[Cached_Size_MB] [decimal](10, 2) NULL,
[Buffer_Pool_Percent] [decimal](5, 2) NULL
)
-- Adding PK-CI to table2
CREATE UNIQUE CLUSTERED INDEX [PK_LogDate_Database_Name] ON [dbo].[Table2]
(
[LogDate] ASC,
[Database_Name] ASC
)
GO
SELECT object_name(object_id), * FROM sys.index_columns
WHERE object_id IN ( object_id('table1'), object_id('table2') )
SELECT * FROM sys.indexes
WHERE name LIKE '%PK_LogDate%'

To all intents and purposes there is no difference here.
A unique index would allow null but the columns are not null anyway.
Also a unique index (though not constraint) could be declared with included columns or as a filtered index but neither of those apply here as the index is clustered.
The primary key creates a named constraint object that is schema scoped so the name must be unique. An index must only be named uniquely within the table it is part of.
I would still opt for the PK though to get the visual indicator in the tooling. It allows other developers (and possibly code) to more easily detect what is the unique row identifier.

Also remember that while a table can have only one PK, it could have multiple unique indexes (although only one can be clustered).
I can see where you might want to cluster on information that is unique in some meaningful way but might want to have a separate autogenerated nonclustered PK to make joins faster than joining on the automobile VIN number, for instance. That is why both are available.

Primary key is a key that identifies each row in a unique way (it's a unique index too). It could be clustered or not but it's highly recommended to be clustered. If it is clustered, data is stored based on that key.
A unique clustered index is a unique value (or combination of values) and the data is stored based on that index.
What's the advantage of a clustered index? if you have to an index scan (scan the whole index), data is stored together so it's faster.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight