SQL Server: Clustering by timestamp; pros/cons

SQL Server: Clustering by timestamp; pros/cons - sql-server

I have a table in SQL Server, where i want inserts to be added to the end of the table (as opposed to a clustering key that would cause them to be inserted in the middle). This means I want the table clustered by some column that will constantly increase.
This could be achieved by clustering on a datetime column:
CREATE TABLE Things (
...
CreatedDate datetime DEFAULT getdate(),
[timestamp] timestamp,
CONSTRAINT [IX_Things] UNIQUE CLUSTERED (CreatedDate)
)
But I can't guaranteed that two Things won't have the same time. So my requirements can't really be achieved by a datetime column.
I could add a dummy identity int column, and cluster on that:
CREATE TABLE Things (
...
RowID int IDENTITY(1,1),
[timestamp] timestamp,
CONSTRAINT [IX_Things] UNIQUE CLUSTERED (RowID)
)
But you'll notice that my table already constains a timestamp column; a column which is guaranteed to be a monotonically increasing. This is exactly the characteristic I want for a candidate cluster key.
So I cluster the table on the rowversion (aka timestamp) column:
CREATE TABLE Things (
...
[timestamp] timestamp,
CONSTRAINT [IX_Things] UNIQUE CLUSTERED (timestamp)
)
Rather than adding a dummy identity int column (RowID) to ensure an order, I use what I already have.
What I'm looking for are thoughts of why this is a bad idea; and what other ideas are better.
Note: Community wiki, since the answers are subjective.

So I cluster the table on the
rowversion (aka timestamp) column:
Rather than adding a dummy identity
int column (RowID) to ensure an order,
I use what I already have.
That might sound like a good idea at first - but it's really almost the worst option you have. Why?
The main requirements for a clustered key are (see Kim Tripp's blog post for more excellent details):
stable
narrow
unique
ever-increasing if possible
Your rowversion violates the stable requirement, and that's probably the most important one. The rowversion of a row changes with each modification to the row - and since your clustering key is being added to each and every non-clustered index in the table, your server will be constantly updating loads of non-clustered indices and wasting a lot of time doing so.
In the end, adding a dummy identity column probably is a much better alternative for your case. The second best choice would be the datetime column - but here, you do run the risk of SQL Server having to add "uniqueifiers" to your entries when duplicates occur - and with a 3.33ms accuracy, this could definitely be happening - not optimal, but definitely much better than the rowversion idea...

from the link: timestamp in the question:
The timestamp syntax is deprecated.
This feature will be removed in a
future version of Microsoft SQL
Server. Avoid using this feature in
new development work, and plan to
modify applications that currently use
this feature.
and
Duplicate rowversion values can be
generated by using the SELECT INTO
statement in which a rowversion column
is in the SELECT list. We do not
recommend using rowversion in this
manner.
so why on earth would you want to cluster by either, especially since their values alwsys change when the row is updated? just use an identity as the PK and cluster on it.

You were on the right track already. You can use a DateTime column that holds the created date and create a CLUSTERED but non unique constraint.
CREATE TABLE Things (
...
CreatedDate datetime DEFAULT getdate(),
[timestamp] timestamp,
)
CREATE CLUSTERED INDEX [IX_CreatedDate] ON .[Things]
(
[CreatedDate] ASC
)

If this table gets a lot of inserts, you might be creating a hot spot that interferes with updates, because all of the inserts will be happening on the same physical/index pages. Check your locking setup.

Related

SQL Server - smart approach for combining GUID and identity column

I am trying to come up with a design for my database where across all my tables I'd like to have the combination of a GUID column (uniqueidentifier data type) and an identity column (int data type).
The GUID column is going to be a NONCLUSTERED index whilst the identity column is going to be the CLUSTERED index. I was wondering if the script below is a correct/safe approach when it comes to database design:
CREATE TABLE country
(
guid uniqueidentifier DEFAULT NEWID() NOT NULL,
code int IDENTITY(1, 1) NOT NULL,
isoCode nvarchar(5) NOT NULL,
description nvarchar(255) NOT NULL,
created date NOT NULL DEFAULT GETDATE(),
updated date NOT NULL DEFAULT GETDATE(),
inactive bit DEFAULT 0
CONSTRAINT NIX_guid PRIMARY KEY NONCLUSTERED(guid),
CONSTRAINT AK_code UNIQUE(code),
CONSTRAINT AK_isoCode UNIQUE(isoCode)
)
GO
CREATE UNIQUE CLUSTERED INDEX [IX_code] ON country ([code] ASC)
GO
That's how it looks after running the above script:
Any tips would be much appreciated!

The domain of all possible countries is never going to be more than a few hundred, so performance should not be a concern.
You already have an isoCode. That is a canonically defined candidate key. I understand what you mean when talking about GUIDs being useful because they can never collide when created on separate servers/application instances/etc. But ISO country codes can never collide either, because they're already defined by an external authority. You don't need the GUID.
Why is your existing isoCode column an nvarchar(5)? There is a 2 letter, and a 3 letter, ISO3166 standard. There are no unicode characters required, so you can use char(2) or char(3) depending on which standard you pick, both of which would be narrower than a 4 byte int.
Yes, an identity-based clustered index does mean not having to worry about page splits on insert. But these are countries. We already know all of the countries you need to insert right now, and are you really worried that the handful of changes that might be made over the next few decades will kill the performance of your system due to page splits on insert? No, so you don't need the identity column either.
Eliminate both surrogates, and just go with the alpha-2 or alpha-3 ISO country code as your clustered primary key.

The row-limitation in compound primary key in SQL Server 2014

I am going to insert a 2.3 billion rows (2,300,000,000) from table_a into table_b. The schema of table_a and table_b are identical, the only difference is table_a doesn't have a primary key but table_b has set up a 4 columns compound primary key with 0 rows of data. I encounter the error message after 24 hours:
Msg 666, Level 16, State 2, Line 1
The maximum system-generated unique value for a duplicate group was exceeded for index with partition ID 422223771074560. Dropping and re-creating the index may resolve this; otherwise, use another clustering key.
This is my compound PK in table_b and the sample query code, any help will be thankful.
column1: varchar(10), not null
column2: nvarchar(50), not null
column3: nvarchar(100), not null
column4: int, not null
Sample code
insert into table_b
select *
from table_a
where date < '2017-01-01' -- some filters here

According to the SQL Server Documentation part of creating a primary key includes creating a unique index on that same table.
When you create a PRIMARY KEY constraint, a unique index on the
column, or columns, is automatically created. By default, this index
is clustered; however, you can specify a nonclustered index when you
create the constraint.
When a unique index is not on the table, each row gets what the docs are calling a "uniqueifier" which is 4 bytes in length (aka ~2.14 Billion combinations)
If the clustered index is not created with the UNIQUE property, the
Database Engine automatically adds a 4-byte uniqueifier column to the
table. When it is required, the Database Engine automatically adds a
uniqueifier value to a row to make each key unique. This column and
its values are used internally and cannot be seen or accessed by
users.
From this information and your error message we can tell two things:
There is a clustered index on the table
There is not a primary key on the table
Given the volume of the data you're dealing with, I'm betting you have a Clustered Columnstore Index on the table, which in SQL Server 2014 does not have the ability to have a primary key on.
One possible solution is to partition table_b based on particular column value (that has less than 15K unique values based on the limitations specified in the documentation). As a side-note, the same partitioning effort could have a significant impact on minimizing run time of any queries using table_b depending on which column is used in the partition function.

You know that:
If the clustered index is not created with the UNIQUE property, the
Database Engine automatically adds a 4-byte uniqueifier column to the
table. When it is required, the Database Engine automatically adds a
uniqueifier value to a row to make each key unique. This column and
its values are used internally and cannot be seen or accessed by
users.
While it´s unlikely that you will face an issue related with uniqueifiers, we have seen rare cases where customer reaches the uniqueifier limit of 2,147,483,648, generating error 666.
And from this topic about the issue we have:
As of February 2018, the design goal for the storage engine is to not
reset uniqueifiers during REBUILDs. As such, rebuild of the index
ideally would not reset uniquifiers and issue would continue to occur,
while inserting new data with a key value for which the uniquifiers
were exhausted. But current engine behavior is different for one
specific case, if you use the statement ALTER INDEX ALL ON
REBUILD WITH (ONLINE = ON), it will reset the uniqueifiers (across all
version starting SQL Server 2005 to SQL Server 2017).
So, if this is the cause if your issue, you can add additional integer column and build the index over it.

designing new table for daily uploads - use unique constraint

I am using SQL Server 2012 & am creating a table that will have 8 columns, types below
datetime
varchar(12)
varchar(6)
varchar(100)
float
float
int
datetime
Once a day (normally) there will be an upload of approx 10,000 rows of data. Going forward its possible it could be 100,000.
The rows will be unique if I group on the first three columns listed above. I have read I can use the unique constraint on multiple columns which will guarantee the rows are unique.
I think I'm correct in saying that the unique constraint by default sets up non-clustered index. Would a clustered index be better & assuming when the table starts to contain millions of rows this won't cause any issues?
My last question. By applying the unique constraint on my table I am right to say querying the data will be quicker than if the unique constraint wasn't applied (because of the non-clustering or clustering) & uploading the data will be slower (which is fine) with the constraint on the table?

Unique index can be non-clustered.
Primary key is unique and can be clustered
Clustered index is not unique by default
Unique clustered index is unique :)
Mor information you can get from this guide.
So, we should separate uniqueness and index keys.
If you need to kepp data unique by some column - create uniqe contraint (unique index). You'll protect your data.
Also, you can create primary key (PK) on your columns - they will be unique also. But, there is a difference: all other indexies will use PK for referencing, so PK must be as short as possible. So, my advice - create Identity column (int or bigint) and create PK on it. And, create unique index on your unique columns.
Querying data may become faster, if you do queries on your unique columns, if you do query on other columns - you need to create other, specific indexies.
So, unique keys - for data consistency, indexies - for queries.

I think I'm correct in saying that the unique constraint by default
sets up non-clustered index
TRUE
Would a clustered index be better & assuming when the table starts to
contain millions of rows this won't cause any issues?
(1)if u need to make (datetime ,varchar(12), varchar(6)) Unique
(2)if you application or you will access rows using datetime or datetime ,varchar(12) or datetime ,varchar(12), varchar(6) in where condition
ALL the time
then have primary key on (datetime ,varchar(12), varchar(6))
by default it will put Uniqness and clustered index on all above three column.
but as you commented above:
the queries will vary to be honest. I imagine most queries will make
use of the first datetime column
and you will deal with huge data and might join this table with other tables
then its better have a surrogate key( ever-increasing unique identifier ) in the table and to satisfy your Selects
have Non-Clustered INDEXES
Surrogate Key vs Business Key
NON-CLUSTERED INDEX

Are 'Primary Keys' obligatory in SQL Server Design?

Observe the following table model:
CREATE TABLE [site].[Permissions] (
[ID] INT REFERENCES [site].[Accounts]( [ID] ) NOT NULL,
[Type] SMALLINT NOT NULL,
[Value] INT NULL
);
The site.Accounts->site.Permissions is a one-to-many relationship so 'ID' cannot be made a primary key due to the uniqueness that a PK imposes.
The rows are selected using a WHERE [ID] = ? clause, so adding a phoney IDENTITY column and making it the PK yields no benefit at the cost of additional disk space.
Its my understanding that the targeted platform - SQL Server (2008) - does not support composite PKs. These all add up to my question: If a Primary Key is not used, so something wrong? Or could something be more right?

Your understanding is not correct, SQL Server does support composite primary keys!
The syntax to add one would be
ALTER TABLE [site].[Permissions]
ADD CONSTRAINT PK_Permissions PRIMARY KEY CLUSTERED (id,[Type])
Regarding the question in the comments "What is the benefit of placing a PK on the entire table?"
I'm not sure from your description though what the PK would need to be on. Is it all 3 columns or just 2 of them? If it's on id,[Type] then presumably you wouldn't want the possibility that the same id,[Type] combo could appear multiple times with conflicting values.
If it is on all 3 columns then to turn the question around why wouldn't you want a primary key?
If you are going to have a clustered index on your table you could just make that the primary key. If say you made a clustered index on the id column only SQL Server would add in uniqueifiers anyway to make it unique and your columns are so narrow (int,smallint,int) this just seems a pointless addition.
Additionally the query optimiser can use unique constraints to improve its query plans (though might not apply if the only queries on that table really are WHERE [ID] = ?) and it would be pretty wasteful to allow duplicates that you then have to both store and filter out with DISTINCT.

Creating a Primary Key on a temp table - When?

I have a stored procedure that is working with a large amount of data. I have that data being inserted in to a temp table. The overall flow of events is something like
CREATE #TempTable (
Col1 NUMERIC(18,0) NOT NULL, --This will not be an identity column.
,Col2 INT NOT NULL,
,Col3 BIGINT,
,Col4 VARCHAR(25) NOT NULL,
--Etc...
--
--Create primary key here?
)
INSERT INTO #TempTable
SELECT ...
FROM MyTable
WHERE ...
INSERT INTO #TempTable
SELECT ...
FROM MyTable2
WHERE ...
--
-- ...or create primary key here?
My question is when is the best time to create a primary key on my #TempTable table? I theorized that I should create the primary key constraint/index after I insert all the data because the index needs to be reorganized as the primary key info is being created. But I realized that my underlining assumption might be wrong...
In case it is relevant, the data types I used are real. In the #TempTable table, Col1 and Col4 will be making up my primary key.
Update: In my case, I'm duplicating the primary key of the source tables. I know that the fields that will make up my primary key will always be unique. I have no concern about a failed alter table if I add the primary key at the end.
Though, this aside, my question still stands as which is faster assuming both would succeed?

This depends a lot.
If you make the primary key index clustered after the load, the entire table will be re-written as the clustered index isn't really an index, it is the logical order of the data. Your execution plan on the inserts is going to depend on the indexes in place when the plan is determined, and if the clustered index is in place, it will sort prior to the insert. You will typically see this in the execution plan.
If you make the primary key a simple constraint, it will be a regular (non-clustered) index and the table will simply be populated in whatever order the optimizer determines and the index updated.
I think the overall quickest performance (of this process to load temp table) is usually to write the data as a heap and then apply the (non-clustered) index.
However, as others have noted, the creation of the index could fail. Also, the temp table does not exist in isolation. Presumably there is a best index for reading the data from it for the next step. This index will need to either be in place or created. This is where you have to make a tradeoff of speed here for reliability (apply the PK and any other constraints first) and speed later (have at least the clustered index in place if you are going to have one).

If the recovery model of your database is set to simple or bulk-logged, SELECT ... INTO ... UNION ALL may be the fastest solution. SELECT .. INTO is a bulk operation and bulk operations are minimally logged.
eg:
-- first, create the table
SELECT ...
INTO #TempTable
FROM MyTable
WHERE ...
UNION ALL
SELECT ...
FROM MyTable2
WHERE ...
-- now, add a non-clustered primary key:
-- this will *not* recreate the table in the background
-- it will only create a separate index
-- the table will remain stored as a heap
ALTER TABLE #TempTable ADD PRIMARY KEY NONCLUSTERED (NonNullableKeyField)
-- alternatively:
-- this *will* recreate the table in the background
-- and reorder the rows according to the primary key
-- CLUSTERED key word is optional, primary keys are clustered by default
ALTER TABLE #TempTable ADD PRIMARY KEY CLUSTERED (NonNullableKeyField)
Otherwise, Cade Roux had good advice re: before or after.

You may as well create the primary key before the inserts - if the primary key is on an identity column then the inserts will be done sequentially anyway and there will be no difference.

Even more important than performance considerations, if you are not ABSOLUTELY, 100% sure that you will have unique values being inserted into the table, create the primary key first. Otherwise the primary key will fail to be created.
This prevents you from inserting duplicate/bad data.

If you add the primary key when creating the table, the first insert will be free (no checks required.) The second insert just has to see if it's different from the first. The third insert has to check two rows, and so on. The checks will be index lookups, because there's a unique constraint in place.
If you add the primary key after all the inserts, every row has to be matched against every other row. So my guess is that adding a primary key early on is cheaper.
But maybe Sql Server has a really smart way of checking uniqueness. So if you want to be sure, measure it!

I was wondering if I could improve a very very "expensive" stored procedure entailing a bunch of checks at each insert across tables and came across this answer. In the Sproc, several temp tables are opened and reference each other. I added the Primary Key to the CREATE TABLE statement (even though my selects use WHERE NOT EXISTS statements to insert data and ensure uniqueness) and my execution time was cut down SEVERELY. I highly recommend using the primary keys. Always at least try it out even when you think you don't need it.

I don't think it makes any significant difference in your case:
either you pay the penalty a little bit at a time, with each single insert
or you'll pay a larger penalty after all the inserts are done, but only once
When you create it up front before the inserts start, you could potentially catch PK violations as the data is being inserted, if the PK value isn't system-created.
But other than that - no big difference, really.
Marc

I wasn't planning to answer this, since I'm not 100% confident on my knowledge of this. But since it doesn't look like you are getting much response ...
My understanding is a PK is a unique index and when you insert each record, your index is updated and optimized. So ... if you add the data first, then create the index, the index is only optimized once.
So, if you are confident your data is clean (without duplicate PK data) then I'd say insert, then add the PK.
But if your data may have duplicate PK data, I'd say create the PK first, so it will bomb out ASAP.

When you add PK on table creation - the insert check is O(Tn) (where Tn is "n-th triangular number", which is 1 + 2 + 3 ... + n) because when you insert x-th row, it's checked against previously inserted "x - 1" rows
When you add PK after inserting all the values - the checker is O(n^2) because when you insert x-th row, it's checked against all n existing rows.
First one is obviously faster since O(Tn) is less than O(n^2)
P.S. Example: if you insert 5 rows it is 1 + 2 + 3 + 4 + 5 = 15 operations vs 5^2 = 25 operations

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight