I have this table variable I use in my SP:
DECLARE #t TABLE(ID uniqueidentifier)
Then I insert some data into it (I later use):
INSERT INTO #t(ID)
SELECT ID FROM Categories WHERE ...
And later I have a few SELECT and UPDATE based on #t IDs e.g.:
SELECT * FROM Categories A INNER JOIN #t T ON A.ID = T.ID
etc..
Should I declare ID uniqueidentifier PRIMARY KEY to increase permanence in the SELECT / UPDATE statements?
If yes should it be clustered or non clustered?
What is the advised option in my case?
EDIT: All my tables in the DB have uniqueidentifier (ID) column as a primary key NONCLUSTERED
EDIT2 : Strangely (or not) when I tried to use PRIMARY KEY NONCLUSTERED on the table variable, when using joined SELECT I see in the execution plan that there is a Table Scan on #t. but when I omit NONCLUSTERED there is a Clustered Index Scan.
If you are worried about performance, you should probably not be using a table variable, but use a temporary table instead. The problem with table variables is that statements referencing it are compiled when the table is empty and therefore the query optimiser always assumes there is only one row. This can result in suboptimal performance when the table variable is populated with many more rows.
Regarding the primary key, there are downsides to make the primary clustered as it will result in the table being physically ordered by the index. The overhead of this sorting operation may outweigh the performance benefit when querying the data. In general it is better to add a non-clustered index, however, as always, it will depend on your particular problem and you will have to test the different implementations.
Related
I am seeing an odd behavior in SQL Server that doesn't make any sense. I have a PERSISTED Computed Column that is stored in a Covering Index. However, when this computed column in the covering index needs to be referenced in order to update another column in another table, the optomizer will choose not to use it at all and instead do a Key Lookup to get the values from the clustered index. Why?
The UPDATE statement at the very end of this demo script works as expected, but if you look closely at the execution plan for it, the update does NOT use the covering index IX_MyTable_VarcharValue1_ComputedColumn. Instead, it will do a Key Lookup and go back to the clustered index to get VarcharValue2, even tough the ComputedColumn that needs for the update is literally already there! In my mind PERSISTED means persisted to disk. So why isn't it using the value when it looks at the non-clustered index the first time to get VarcharValue1? Is this not extra work doing the Key Lookup?
CREATE TABLE dbo.MyTable (
[ID] INT NOT NULL
, [VarcharValue1] VARCHAR(50) NOT NULL
, [NotComputedColumn] VARCHAR(50) NULL
CONSTRAINT [PK_MyTable]
PRIMARY KEY CLUSTERED(ID ASC)
) ON [PRIMARY];
CREATE NONCLUSTERED INDEX IX_MyTable_VarcharValue1
ON dbo.MyTable ([VarcharValue1] ASC);
CREATE TABLE dbo.ComputedColumnTable (
[ID] INT NOT NULL
, [VarcharValue1] VARCHAR(50) NOT NULL
, [VarcharValue2] VARCHAR(50) NOT NULL
, [ComputedColumn] AS [VarcharValue1] + [VarcharValue2] PERSISTED NOT NULL
CONSTRAINT [PK_ComputedColumnTable]
PRIMARY KEY CLUSTERED(ID ASC)
) ON [PRIMARY];
CREATE NONCLUSTERED INDEX IX_MyTable_VarcharValue1_ComputedColumn
ON dbo.ComputedColumnTable ([VarcharValue1] ASC, [ComputedColumn] ASC);
INSERT INTO dbo.MyTable VALUES(1,'e',NULL)
INSERT INTO dbo.MyTable VALUES(2,'d',NULL)
INSERT INTO dbo.MyTable VALUES(3,'c',NULL)
INSERT INTO dbo.MyTable VALUES(4,'b',NULL)
INSERT INTO dbo.MyTable VALUES(5,'a',NULL)
INSERT INTO dbo.ComputedColumnTable VALUES(1,'a','b')
INSERT INTO dbo.ComputedColumnTable VALUES(2,'b','c')
INSERT INTO dbo.ComputedColumnTable VALUES(3,'c','d')
INSERT INTO dbo.ComputedColumnTable VALUES(4,'d','e')
INSERT INTO dbo.ComputedColumnTable VALUES(5,'e','f')
SELECT * FROM dbo.MyTable
SELECT * FROM dbo.ComputedColumnTable
-- uses a Key Lookup to get VarcharValue2 instead of the ComputedColumn in the covering index
UPDATE m
SET m.NotComputedColumn = c.ComputedColumn
FROM MyTable m
JOIN ComputedColumnTable c
ON m.VarcharValue1 = c.VarcharValue1
Edit: Adding link for Execution Plan: https://www.brentozar.com/pastetheplan/?id=Hkk_MZ8JK
The clustered index update operator modifies the clustered index pages so the optimizer decided to expand the definition of the persisted column definition.
It can come as quite a shock to see SQL Server recomputing the
underlying expression each time while ignoring the
deliberately-provided stored value
Properly Persisted Computed Columns
In this case, the optimizer required the VarcharValue2 to demand a new calculation from the compute scalar operator. When we look at the compute scalar operator it re-calculates the persisted computed column value. This situation seems in the output of the ScalarString attribute.
On the other hand, we can avoid the operator key lookup operation to eliminating the Stream Aggregate but this time optimizer will decide to perform a Clustered Index Scan on the ComputedColumnTable. Because it still requires the VarcharValue2 column but only the optimizer changes the data access methods.
--Don't use this options in the production database
DBCC TRACEON (3604);
DBCC RULEOFF('GbAggToStrm')
GO
UPDATE m
SET m.NotComputedColumn = c.ComputedColumn
FROM MyTable m
JOIN ComputedColumnTable c
ON m.VarcharValue1 = c.VarcharValue1
DBCC RULEON('GbAggToStrm')
As a result the VarcharValue2 column requirement did not change.
What we can do: To resolve this situation we can add the VarcharValue2IX_MyTable_VarcharValue1_ComputedColumn index definition.So that, we can eliminate the key lookup operation.
I have table with about 60 000 000 rows in it and having size near 60 GB.
It has 2 indexes: clustered index and primary key on same id identity column.
Primary key index has size near 1GB. It looks excessive. I have several such tables.
Question is, is there any way to effectively mark existing clustered index as primary key also without dropping both indexes and creating one single new index?
Sub question is does it worth it to do such operations, maybe only drop primary keys? What's the real practical advantage of having primary key except descriptive usage for 3rd party tools? maybe sql server optimizer uses this metadata to optimize queries or any other advantages that I am missing?
Small sample of what I want to achieve, but other way (if it exists), without dropping and creating indexes.
I assume that it looks like there's no other way really, but who knows maybe there's some trick.
create table a
(
id int identity,
col1 varchar(50)
)
create unique clustered index cix_id on a(id)
alter table a add constraint pk_a primary key nonclustered(id)
select t.name,i.name,i.is_primary_key,i.type_desc from
sys.tables t
inner join sys.indexes i on i.object_id=t.object_id
where t.name='a'
drop index cix_id on a
alter table a drop constraint pk_a
alter table a add constraint pk_a primary key clustered(id)
select t.name,i.name,i.is_primary_key,i.type_desc from
sys.tables t
inner join sys.indexes i on i.object_id=t.object_id
where t.name='a'
I am trying to convert tables from using guid primary keys / clustered indexes to using int identities. This is for SQL Server 2005. There are two tables MainTable and RelatedTable, and the current table structure is as follows:
MainTable [40 million rows]
IDGuid - uniqueidentifier - PK
-- [data columns]
RelatedTable [400 million rows]
RelatedTableID - uniqueidentifier - PK
MainTableIDGuid - uniqueidentifier [foreign key to MainTable]
SequenceNumber - int - incrementing number per main table entry since there can be multiple entries related to a given row in the main table. These go from 1,2,3... etc for each MainTableIDGuid value.
-- [data columns]
The clustered index for MainTable is currently the primary key (IDGuid). The clustered index for RelatedTable is currently (MainTableIDGuid, SequenceNumber).
I want my conversion is do several things:<
Change MainTable to use an integer ID instead of GUID
Add a MainTableIDInt column to related table that links to Main Table's integer ID
Change the primary key and clustered index of RelatedTable to (MainTableIDInt, SequenceNumber)
Get rid of the guid columns.
I've written a script to do the following:
Add an IDInt int IDENTITY column to MainTable. This does a table rebuild and generates the new identity ID values.
Add a MainTableIDInt int column to RelatedTable.
The next step is to populate the RelatedTable.MainTableIDInt column for each row with its corresponding MainTable.IDInt value [based on the matching guid IDs]. This is the step I'm hung up on. I understand this is not going to be speedy, but I'd like to have it perform as well as possible.
I can write a SQL statement that does this update:
UPDATE RelatedTable
SET RelatedTable.MainTableIDInt = (SELECT MainTable.IDInt FROM MainTable WHERE MainTable.IDGuid = RelatedTable.MainTableIDGuid)
or
UPDATE RelatedTable
SET RelatedTable.MainTableIDInt = MainTable.IDInt
FROM RelatedTable
LEFT OUTER JOIN MainTable ON RelatedTable.MainTableIDGuid = MainTable.IDGuid
The 'Display Estimated Execution Plan' displays roughly the same for both of these queries. The execution plan it spits out does the following:
Clustered index scans over MainTable and RelatedTable and does a Merge Join on them [estimated number of rows = 400 million]
Sorts [estimated number of rows = 400 million]
Clustered index update over RelatedTable [estimated number of rows = 400 million]
I'm concerned about the performance of this [sorting 400 million rows sounds unpleasant]. Are my concerns about performance of these execution plan justified? Is there a better way to update the new ID for my related table that will scale given the size of the tables?
First, this will be a headache. Second, I wouldn't change any of the indexes or constraints until I had the data in place. I.e., I would add the identity column but not make it the primary key nor clustered index. Then I'd add the soon-to-be new foreign keys to the various tables. Your queries should look like:
Update ChildTable
Set NewIntForeignKeyId = P.NewIntPrimaryKey
From ChildTable As C
Join ParentTable As P
On P.PrimaryKey = C.ForeignKey
First, notice that I'm using an inner join. There is no reason to use an outer join for this type of query given that you will eventually enforce referential integrity between the new columns. Second, if you populate the columns first and then rebuild the constraints, it will be faster as you'll be able to leverage the existing indexes. Remember that when you change the clustered index, it rebuilds all of the nonclustered indexes. If the tables are large, that will be a serious hit.
Once you have the data in place, I'd then drop all primary constraints, unique constraints, foreign key constraints and unique indexes. Drop the clustered index/constraint last. I'd then add the clustered indexes to all of the tables and after that was done, recreate the unique constraints, foreign key constraints and indexes. If you do not drop the existing indexes before you recreate the clustered index, it will rebuild the existing indexes twice: once when you drop the clustered index and again when you recreate it.
Btw, I highly doubt there is a way to avoid table scans for this sort of thing since you are going to be updating every row.
Is the following possible? I am unable to do so. Do I have to have a permanent table to create index?
declare #Beatles table
(
LastName varchar(20) ,
FirstName varchar(20)
)
CREATE CLUSTERED INDEX Index_Name_Clstd ON #Beatles(LastName)
Not on a table variable, but on a temp table see this http://www.sqlteam.com/article/optimizing-performance-indexes-on-temp-tables
No, you cannot create indices on a table variable - see this article here and this posting here comparing local, global temporary tables to table variables.
Restrictions
You cannot create a non-clustered
index on a table variable, unless the
index is a side effect of a PRIMARY
KEY or UNIQUE constraint on the table
(SQL Server enforces any UNIQUE or
PRIMARY KEY constraints using an
index).
According to this post - YES you can.
The following declaration will generate 2 indexes:
DECLARE #Users TABLE
(
UserID INT PRIMARY KEY,
UserName VARCHAR(50),
FirstName VARCHAR(50),
UNIQUE (UserName,UserID)
)
The first index will be clustered, and will include the primary key.
The second index will be non clustered and will include the the columns listed in the unique constraint.
Here is another post, showing how to force the query optimizer to use the indexes generated dynamically, because it will tend to ignore them (the indexes will be generated after the execution plan is evaluated)
I have a stored procedure that is working with a large amount of data. I have that data being inserted in to a temp table. The overall flow of events is something like
CREATE #TempTable (
Col1 NUMERIC(18,0) NOT NULL, --This will not be an identity column.
,Col2 INT NOT NULL,
,Col3 BIGINT,
,Col4 VARCHAR(25) NOT NULL,
--Etc...
--
--Create primary key here?
)
INSERT INTO #TempTable
SELECT ...
FROM MyTable
WHERE ...
INSERT INTO #TempTable
SELECT ...
FROM MyTable2
WHERE ...
--
-- ...or create primary key here?
My question is when is the best time to create a primary key on my #TempTable table? I theorized that I should create the primary key constraint/index after I insert all the data because the index needs to be reorganized as the primary key info is being created. But I realized that my underlining assumption might be wrong...
In case it is relevant, the data types I used are real. In the #TempTable table, Col1 and Col4 will be making up my primary key.
Update: In my case, I'm duplicating the primary key of the source tables. I know that the fields that will make up my primary key will always be unique. I have no concern about a failed alter table if I add the primary key at the end.
Though, this aside, my question still stands as which is faster assuming both would succeed?
This depends a lot.
If you make the primary key index clustered after the load, the entire table will be re-written as the clustered index isn't really an index, it is the logical order of the data. Your execution plan on the inserts is going to depend on the indexes in place when the plan is determined, and if the clustered index is in place, it will sort prior to the insert. You will typically see this in the execution plan.
If you make the primary key a simple constraint, it will be a regular (non-clustered) index and the table will simply be populated in whatever order the optimizer determines and the index updated.
I think the overall quickest performance (of this process to load temp table) is usually to write the data as a heap and then apply the (non-clustered) index.
However, as others have noted, the creation of the index could fail. Also, the temp table does not exist in isolation. Presumably there is a best index for reading the data from it for the next step. This index will need to either be in place or created. This is where you have to make a tradeoff of speed here for reliability (apply the PK and any other constraints first) and speed later (have at least the clustered index in place if you are going to have one).
If the recovery model of your database is set to simple or bulk-logged, SELECT ... INTO ... UNION ALL may be the fastest solution. SELECT .. INTO is a bulk operation and bulk operations are minimally logged.
eg:
-- first, create the table
SELECT ...
INTO #TempTable
FROM MyTable
WHERE ...
UNION ALL
SELECT ...
FROM MyTable2
WHERE ...
-- now, add a non-clustered primary key:
-- this will *not* recreate the table in the background
-- it will only create a separate index
-- the table will remain stored as a heap
ALTER TABLE #TempTable ADD PRIMARY KEY NONCLUSTERED (NonNullableKeyField)
-- alternatively:
-- this *will* recreate the table in the background
-- and reorder the rows according to the primary key
-- CLUSTERED key word is optional, primary keys are clustered by default
ALTER TABLE #TempTable ADD PRIMARY KEY CLUSTERED (NonNullableKeyField)
Otherwise, Cade Roux had good advice re: before or after.
You may as well create the primary key before the inserts - if the primary key is on an identity column then the inserts will be done sequentially anyway and there will be no difference.
Even more important than performance considerations, if you are not ABSOLUTELY, 100% sure that you will have unique values being inserted into the table, create the primary key first. Otherwise the primary key will fail to be created.
This prevents you from inserting duplicate/bad data.
If you add the primary key when creating the table, the first insert will be free (no checks required.) The second insert just has to see if it's different from the first. The third insert has to check two rows, and so on. The checks will be index lookups, because there's a unique constraint in place.
If you add the primary key after all the inserts, every row has to be matched against every other row. So my guess is that adding a primary key early on is cheaper.
But maybe Sql Server has a really smart way of checking uniqueness. So if you want to be sure, measure it!
I was wondering if I could improve a very very "expensive" stored procedure entailing a bunch of checks at each insert across tables and came across this answer. In the Sproc, several temp tables are opened and reference each other. I added the Primary Key to the CREATE TABLE statement (even though my selects use WHERE NOT EXISTS statements to insert data and ensure uniqueness) and my execution time was cut down SEVERELY. I highly recommend using the primary keys. Always at least try it out even when you think you don't need it.
I don't think it makes any significant difference in your case:
either you pay the penalty a little bit at a time, with each single insert
or you'll pay a larger penalty after all the inserts are done, but only once
When you create it up front before the inserts start, you could potentially catch PK violations as the data is being inserted, if the PK value isn't system-created.
But other than that - no big difference, really.
Marc
I wasn't planning to answer this, since I'm not 100% confident on my knowledge of this. But since it doesn't look like you are getting much response ...
My understanding is a PK is a unique index and when you insert each record, your index is updated and optimized. So ... if you add the data first, then create the index, the index is only optimized once.
So, if you are confident your data is clean (without duplicate PK data) then I'd say insert, then add the PK.
But if your data may have duplicate PK data, I'd say create the PK first, so it will bomb out ASAP.
When you add PK on table creation - the insert check is O(Tn) (where Tn is "n-th triangular number", which is 1 + 2 + 3 ... + n) because when you insert x-th row, it's checked against previously inserted "x - 1" rows
When you add PK after inserting all the values - the checker is O(n^2) because when you insert x-th row, it's checked against all n existing rows.
First one is obviously faster since O(Tn) is less than O(n^2)
P.S. Example: if you insert 5 rows it is 1 + 2 + 3 + 4 + 5 = 15 operations vs 5^2 = 25 operations