SQL Server - smart approach for combining GUID and identity column - sql-server

I am trying to come up with a design for my database where across all my tables I'd like to have the combination of a GUID column (uniqueidentifier data type) and an identity column (int data type).
The GUID column is going to be a NONCLUSTERED index whilst the identity column is going to be the CLUSTERED index. I was wondering if the script below is a correct/safe approach when it comes to database design:
CREATE TABLE country
(
guid uniqueidentifier DEFAULT NEWID() NOT NULL,
code int IDENTITY(1, 1) NOT NULL,
isoCode nvarchar(5) NOT NULL,
description nvarchar(255) NOT NULL,
created date NOT NULL DEFAULT GETDATE(),
updated date NOT NULL DEFAULT GETDATE(),
inactive bit DEFAULT 0
CONSTRAINT NIX_guid PRIMARY KEY NONCLUSTERED(guid),
CONSTRAINT AK_code UNIQUE(code),
CONSTRAINT AK_isoCode UNIQUE(isoCode)
)
GO
CREATE UNIQUE CLUSTERED INDEX [IX_code] ON country ([code] ASC)
GO
That's how it looks after running the above script:
Any tips would be much appreciated!

The domain of all possible countries is never going to be more than a few hundred, so performance should not be a concern.
You already have an isoCode. That is a canonically defined candidate key. I understand what you mean when talking about GUIDs being useful because they can never collide when created on separate servers/application instances/etc. But ISO country codes can never collide either, because they're already defined by an external authority. You don't need the GUID.
Why is your existing isoCode column an nvarchar(5)? There is a 2 letter, and a 3 letter, ISO3166 standard. There are no unicode characters required, so you can use char(2) or char(3) depending on which standard you pick, both of which would be narrower than a 4 byte int.
Yes, an identity-based clustered index does mean not having to worry about page splits on insert. But these are countries. We already know all of the countries you need to insert right now, and are you really worried that the handful of changes that might be made over the next few decades will kill the performance of your system due to page splits on insert? No, so you don't need the identity column either.
Eliminate both surrogates, and just go with the alpha-2 or alpha-3 ISO country code as your clustered primary key.

Related

Convert VARBINARY to Int or BigInt

My question is very simple and I understand that few Old days DB design is not good as we espect these days.
My legacy table does not have Primary key to perform Delta load. Hence, I'm trying to use Hashing concept to create Unique key. As "HASHBYTES" return VarBinary and I can not use VarBinary type as
primay key (not sure about this)
Ref URL on MSDN:
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/94231bb4-ccab-4626-a9fb-325264bb883f/can-varbinary700-column-be-used-as-primary-key?forum=transactsql
hence, I'm converting this to INT or BigInt. The problem is it gives both negative as well as positive value(due to the range).
My Question is:
How can I convert VARBINARY(100) type to integer or BigInt (+ve value) and Set this as a Primary key in one of my table?
Edit Note:
I tried to use VARBINARY as Primary key for Delta load in SSIS Lookup task. I got the error:
"Violation of PRIMARY KEY constraint 'PK__DMIN__607056C02FB7E7DE'. Cannot insert duplicate key in object 'dbo.DMIN_'. The duplicate key value is (0x00001195764c40525bcaf6baa922091696cd8886).".
However, when I checked for duplicate key from the table. Table does not have duplicate key. Then why this error is showing up?
Please note that, the 1st time of SSIS execution worked fine. However, it shows error during 2nd execution [during "lookup match output"].
Please help. Thanks.
In projects I've worked on before we've always used GUIDs as our primary keys, utilising the unique identifier type in SQL Server.
The main problem with this however, is that using a uniqueidentifier type as your clustered index can degrade the performance of your database after some time, so recently we've taken the following approach (based on this article):
Create column: guid, uniqueidentifier, nonnull, default value newsequentialid(), PK
Create column: id, bigint, nonnull, identity(1,1)
Create a non clustered index on the guid column, unique
Create a clustered index on the id column, unique
That way when you insert into this new table, you don't have to worry about the keys or identities.
If you need some form of reference between the old database and the new and you CAN modify the structure of the old database, you can create a uniqueidentifier column in that (or char(36) if it doesn't support uniqueidentifier) and assign a guid to each of those and THEN create an additional uniqueidentifier column in the new database so you have that reference and insert that value into it. If that makes sense.

Recommended SQL Server table design for file import and processing

I have a scenario where files will be uploaded into a database table (dbo.FileImport) with each line of the file in a new row. Each row will contain the line data and the name of the file it came from. The file names are unique but may contain a few million lines. Multiple file's data may exist in the table at one time.
Each file is processed and the results are stored in a separate table. After processing the data related to the file, the data is deleted from the import table to keep the table from growing indefinitely.
The table structure is as follows:
CREATE TABLE [dbo].[FileImport] (
[Id] BIGINT IDENTITY (1, 1) NOT NULL,
[FileName] VARCHAR (100) NOT NULL,
[LineData] NVARCHAR (300) NOT NULL
);
During the processing the data for the relevant file is loaded with the following query:
SELECT [LineData] FROM [dbo].[FileImport] WHERE [FileName] = #FileName
And then deleted with the following statement:
DELETE FROM [dbo].[FileImport] WHERE [FileName] = #FileName
My question is pertaining to the table design with regard to performance and longevity...
Is it necessary to have the [Id] column if I never use it (I am concerned about running out of numbers in the Identity eventually too)?
Should I add a PRIMARY KEY Constraint to the [Id] column?
Should I have a CLUSTERED or NONCLUSTERED index for the [FileName] column?
Should I be making use of NOLOCK whenever I query this table (it is updated very regularly)?
Would there be concern of fragmentation with the continual adding and deleting of data to/from this table? If so, how should I handle this?
Any advice or thoughts would be much appreciated. Opinionated designs are welcome ;-)
Update 2017-12-10
I failed to mention that the lines of a file may not be unique. So please take this into account if this affects the recommendation.
An example script in the answer would be an added bonus! ;-)
Is it necessary to have the [Id] column if I never use it (I am
concerned about running out of numbers in the Identity eventually
too)?
It is not necessary to have an unused column. This is not a relational table and will not be referenced by a foreign key so one could make the argument a primary key is unnecessary.
I would not be concerned about running out of 64-bit integer values. bigint can hold a positive value of up to 36,028,797,018,963,967. It would take centuries to run out of values if you load 1 billion rows a second.
Should I add a PRIMARY KEY Constraint to the [Id] column?
I would create a composite clustered primary key on FileName and ID. That would provide an incremental value to facilitate retrieving rows in the order of insertion and the FileName leftmost key column would benefit your queries greatly.
Should I have a CLUSTERED or NONCLUSTERED index for the [FileName]
column?
See above.
Should I be making use of NOLOCK whenever I query this table (it is
updated very regularly)?
No. Assuming you query by FileName, only the rows requested will be touched with the suggested primary key.
Would there be concern of fragmentation with the continual adding and
deleting of data to/from this table? If so, how should I handle this?
Incremental keys avoid fragmentation.
EDIT:
Here's the suggested DDL for the table:
CREATE TABLE dbo.FileImport (
FileName VARCHAR (100) NOT NULL
, RecordNumber BIGINT NOT NULL IDENTITY
, LineData NVARCHAR (300) NOT NULL
CONSTRAINT PK_FileImport PRIMARY KEY CLUSTERED(FileName, RecordNumber)
);
Here is a rough sketch how I would do it
CREATE TABLE [FileImport].[FileName] (
[FileId] BIGINT IDENTITY (1, 1) NOT NULL,
[FileName] VARCHAR (100) NOT NULL
);
go
alter table [FileImport].[FileName]
add constraint pk_FileName primary key nonclustered (FileId)
go
create clustered index cix_FileName on [FileImport].[FileName]([FileName])
go
CREATE TABLE [FileImport].[LineData] (
[FileId] VARCHAR (100) NOT NULL,
[LineDataId] BIGINT IDENTITY (1, 1) NOT NULL,
[LineData] NVARCHAR (300) NOT NULLL.
constraint fk_LineData_FileName foreign key (FileId) references [FileImport].[FileName](FIleId)
);
alter table [FileImport].[LineData]
add constraint pk_FileName primary key clustered (FileId, LineDataId)
go
This is with some normalization so you don't have to reference your full file name every time - you probably don't have to do (in case you prefer not to and just move FileName to second table instead of the FileId and cluster your index on (FileName, LeneDataId)) it but since we are using relational database ...
No need for any additional indexes - tables are sorted by the right keys
Should I be making use of NOLOCK whenever I query this table (it is
updated very regularly)?
If your data means anything to you, don't use it, It's a matter in fact, if you have to use it - something really wrong with your DB architecture. The way it is indexed SQL Server will use Seek operation which is very fast.
Would there be concern of fragmentation with the continual adding and
deleting of data to/from this table? If so, how should I handle this?
You can set up a maintenance job that rebuilds your indexes and run it nightly with Agent (or what ever)

Is there any reason to avoid using bigint for a surrogate primary key?

I am designing a database and I have a number of tables that could potentially outgrow the maximum size of a standard 32-bit int. However, it will likely be years before this happens and there is no guarantee that it will ever even actually happen.
However, given that there is a chance it could happen, should I go ahead and choose bigint for the primary key? What are the implications of doing it now vs changing it later? Is it even possible to convert an int primary key to a bigint later on, and if so, how difficult is it and is it feasible?
Going BIG will cost you on storage and performance - especially if it means your foreign key references also have to be BIGINT.
Looking "years" ahead isn't necessarily a prudent thing to do. Most (not all) IT projects are expected to recover their costs within 3 years. You will most likely have to contend with plenty of changes and upgrades over the years and if your database has grown so much in that time then it shouldn't be so much effort to change an INT to a BIGINT if and when you need to. By then maybe your business and the database world in general will have moved on and it won't be an issue any more. YAGNI rules.
There really isn't a reason to avoid it per se. It does take more storage space though. If this is the primary key column you can't change the datatype unless you first drop the primary key. Assuming you name your primary key constraint this is fairly simple. You don't have to create a new column and do all that hocus pocus nonsense like in the comment from lad2025.
create table IntTest
(
MyID int identity
, SomveValue uniqueidentifier
, constraint IntTest_PK primary key clustered (MyID)
)
insert IntTest
select NEWID()
from sys.all_columns
alter table IntTest drop constraint IntTest_PK
alter table IntTest alter column MyID BigInt
alter table IntTest add constraint IntTest_PK primary key clustered (MyID)

Are 'Primary Keys' obligatory in SQL Server Design?

Observe the following table model:
CREATE TABLE [site].[Permissions] (
[ID] INT REFERENCES [site].[Accounts]( [ID] ) NOT NULL,
[Type] SMALLINT NOT NULL,
[Value] INT NULL
);
The site.Accounts->site.Permissions is a one-to-many relationship so 'ID' cannot be made a primary key due to the uniqueness that a PK imposes.
The rows are selected using a WHERE [ID] = ? clause, so adding a phoney IDENTITY column and making it the PK yields no benefit at the cost of additional disk space.
Its my understanding that the targeted platform - SQL Server (2008) - does not support composite PKs. These all add up to my question: If a Primary Key is not used, so something wrong? Or could something be more right?
Your understanding is not correct, SQL Server does support composite primary keys!
The syntax to add one would be
ALTER TABLE [site].[Permissions]
ADD CONSTRAINT PK_Permissions PRIMARY KEY CLUSTERED (id,[Type])
Regarding the question in the comments "What is the benefit of placing a PK on the entire table?"
I'm not sure from your description though what the PK would need to be on. Is it all 3 columns or just 2 of them? If it's on id,[Type] then presumably you wouldn't want the possibility that the same id,[Type] combo could appear multiple times with conflicting values.
If it is on all 3 columns then to turn the question around why wouldn't you want a primary key?
If you are going to have a clustered index on your table you could just make that the primary key. If say you made a clustered index on the id column only SQL Server would add in uniqueifiers anyway to make it unique and your columns are so narrow (int,smallint,int) this just seems a pointless addition.
Additionally the query optimiser can use unique constraints to improve its query plans (though might not apply if the only queries on that table really are WHERE [ID] = ?) and it would be pretty wasteful to allow duplicates that you then have to both store and filter out with DISTINCT.

SQL Server: Clustering by timestamp; pros/cons

I have a table in SQL Server, where i want inserts to be added to the end of the table (as opposed to a clustering key that would cause them to be inserted in the middle). This means I want the table clustered by some column that will constantly increase.
This could be achieved by clustering on a datetime column:
CREATE TABLE Things (
...
CreatedDate datetime DEFAULT getdate(),
[timestamp] timestamp,
CONSTRAINT [IX_Things] UNIQUE CLUSTERED (CreatedDate)
)
But I can't guaranteed that two Things won't have the same time. So my requirements can't really be achieved by a datetime column.
I could add a dummy identity int column, and cluster on that:
CREATE TABLE Things (
...
RowID int IDENTITY(1,1),
[timestamp] timestamp,
CONSTRAINT [IX_Things] UNIQUE CLUSTERED (RowID)
)
But you'll notice that my table already constains a timestamp column; a column which is guaranteed to be a monotonically increasing. This is exactly the characteristic I want for a candidate cluster key.
So I cluster the table on the rowversion (aka timestamp) column:
CREATE TABLE Things (
...
[timestamp] timestamp,
CONSTRAINT [IX_Things] UNIQUE CLUSTERED (timestamp)
)
Rather than adding a dummy identity int column (RowID) to ensure an order, I use what I already have.
What I'm looking for are thoughts of why this is a bad idea; and what other ideas are better.
Note: Community wiki, since the answers are subjective.
So I cluster the table on the
rowversion (aka timestamp) column:
Rather than adding a dummy identity
int column (RowID) to ensure an order,
I use what I already have.
That might sound like a good idea at first - but it's really almost the worst option you have. Why?
The main requirements for a clustered key are (see Kim Tripp's blog post for more excellent details):
stable
narrow
unique
ever-increasing if possible
Your rowversion violates the stable requirement, and that's probably the most important one. The rowversion of a row changes with each modification to the row - and since your clustering key is being added to each and every non-clustered index in the table, your server will be constantly updating loads of non-clustered indices and wasting a lot of time doing so.
In the end, adding a dummy identity column probably is a much better alternative for your case. The second best choice would be the datetime column - but here, you do run the risk of SQL Server having to add "uniqueifiers" to your entries when duplicates occur - and with a 3.33ms accuracy, this could definitely be happening - not optimal, but definitely much better than the rowversion idea...
from the link: timestamp in the question:
The timestamp syntax is deprecated.
This feature will be removed in a
future version of Microsoft SQL
Server. Avoid using this feature in
new development work, and plan to
modify applications that currently use
this feature.
and
Duplicate rowversion values can be
generated by using the SELECT INTO
statement in which a rowversion column
is in the SELECT list. We do not
recommend using rowversion in this
manner.
so why on earth would you want to cluster by either, especially since their values alwsys change when the row is updated? just use an identity as the PK and cluster on it.
You were on the right track already. You can use a DateTime column that holds the created date and create a CLUSTERED but non unique constraint.
CREATE TABLE Things (
...
CreatedDate datetime DEFAULT getdate(),
[timestamp] timestamp,
)
CREATE CLUSTERED INDEX [IX_CreatedDate] ON .[Things]
(
[CreatedDate] ASC
)
If this table gets a lot of inserts, you might be creating a hot spot that interferes with updates, because all of the inserts will be happening on the same physical/index pages. Check your locking setup.

Resources