Data modeling: (Why) Should I always create indexes on the FK columns?

Data modeling: (Why) Should I always create indexes on the FK columns? - sql-server

Consider the following scenario:
ParrentLookupTable:
PK: SomeCode CHAR(10)
has only 5 rows.
Does not change often
MyTable:
One million rows are being added daily.
Has the column SomeCode FK to ParrentLookupTable
There is no query that search or sort based on SomeCode
Here is the FK definition
ALTER TABLE MyTable CONSTRAINT FK_MyTable_ParrentLookupTable_SomeCode FOREIGN KEY(SomeCode)
REFERENCES ParrentLookupTable (SomeCode)
Should I create an index IX_MyTable_SomeCode?
The index would cost IO during this write intensive workload scenario and I am not sure how it becomes useful?

In general adding an index on the column in the parent table is to speed up the FK check. In this case it might not help at all. ParentLookupTable will fit into one database page so it should not matter much if it is doing a table scan or a index seek.
Adding an index on the FK column on the child table could help in the case of cascading deletes.
Here is a longer article on the subject
http://sqlperformance.com/2012/11/t-sql-queries/benefits-indexing-foreign-keys

Related

SQL Server execution plan is suggesting to create an index containing all the columns in the table

I've got a key table with 2 columns: Key, Id.
In a stored procedure I've written, my code joins the Employee table to the Key column, then selects the Id - something like this:
SELECT
E.EmployeeName, K.Id
FROM
Employee E
JOIN
KeyTable K ON E.Key = K.Key
The execution plan is suggesting to create the following index:
[schema].[Employee] ([Key]) INCLUDE ([Id])
My question is why? If all the information is in the table to begin with why create an index and duplicate that information?

Just because all of the information is "in the table", that doesn't mean that searching the entire table is going to be the most efficient way of obtaining the results for this query.
Here, the server is saying that, if it had a way to quickly locate rows in this table, given a Key value, that the query should be able to be processed more quickly (not that it's 100% reliable in its suggestions, so you should test before implementing).
This can be true if the table is a heap (no clustered index) or for a clustered table where the clustering key(s) don't match the desired access order for the query.
Also, if you think about it - every (non-clustered) index duplicates information. It's just that usually its a subset of the information rather than the whole set.

designing new table for daily uploads - use unique constraint

I am using SQL Server 2012 & am creating a table that will have 8 columns, types below
datetime
varchar(12)
varchar(6)
varchar(100)
float
float
int
datetime
Once a day (normally) there will be an upload of approx 10,000 rows of data. Going forward its possible it could be 100,000.
The rows will be unique if I group on the first three columns listed above. I have read I can use the unique constraint on multiple columns which will guarantee the rows are unique.
I think I'm correct in saying that the unique constraint by default sets up non-clustered index. Would a clustered index be better & assuming when the table starts to contain millions of rows this won't cause any issues?
My last question. By applying the unique constraint on my table I am right to say querying the data will be quicker than if the unique constraint wasn't applied (because of the non-clustering or clustering) & uploading the data will be slower (which is fine) with the constraint on the table?

Unique index can be non-clustered.
Primary key is unique and can be clustered
Clustered index is not unique by default
Unique clustered index is unique :)
Mor information you can get from this guide.
So, we should separate uniqueness and index keys.
If you need to kepp data unique by some column - create uniqe contraint (unique index). You'll protect your data.
Also, you can create primary key (PK) on your columns - they will be unique also. But, there is a difference: all other indexies will use PK for referencing, so PK must be as short as possible. So, my advice - create Identity column (int or bigint) and create PK on it. And, create unique index on your unique columns.
Querying data may become faster, if you do queries on your unique columns, if you do query on other columns - you need to create other, specific indexies.
So, unique keys - for data consistency, indexies - for queries.

I think I'm correct in saying that the unique constraint by default
sets up non-clustered index
TRUE
Would a clustered index be better & assuming when the table starts to
contain millions of rows this won't cause any issues?
(1)if u need to make (datetime ,varchar(12), varchar(6)) Unique
(2)if you application or you will access rows using datetime or datetime ,varchar(12) or datetime ,varchar(12), varchar(6) in where condition
ALL the time
then have primary key on (datetime ,varchar(12), varchar(6))
by default it will put Uniqness and clustered index on all above three column.
but as you commented above:
the queries will vary to be honest. I imagine most queries will make
use of the first datetime column
and you will deal with huge data and might join this table with other tables
then its better have a surrogate key( ever-increasing unique identifier ) in the table and to satisfy your Selects
have Non-Clustered INDEXES
Surrogate Key vs Business Key
NON-CLUSTERED INDEX

How to update guid ID references when converting to identity IDs

I am trying to convert tables from using guid primary keys / clustered indexes to using int identities. This is for SQL Server 2005. There are two tables MainTable and RelatedTable, and the current table structure is as follows:
MainTable [40 million rows]
IDGuid - uniqueidentifier - PK
-- [data columns]
RelatedTable [400 million rows]
RelatedTableID - uniqueidentifier - PK
MainTableIDGuid - uniqueidentifier [foreign key to MainTable]
SequenceNumber - int - incrementing number per main table entry since there can be multiple entries related to a given row in the main table. These go from 1,2,3... etc for each MainTableIDGuid value.
-- [data columns]
The clustered index for MainTable is currently the primary key (IDGuid). The clustered index for RelatedTable is currently (MainTableIDGuid, SequenceNumber).
I want my conversion is do several things:<
Change MainTable to use an integer ID instead of GUID
Add a MainTableIDInt column to related table that links to Main Table's integer ID
Change the primary key and clustered index of RelatedTable to (MainTableIDInt, SequenceNumber)
Get rid of the guid columns.
I've written a script to do the following:
Add an IDInt int IDENTITY column to MainTable. This does a table rebuild and generates the new identity ID values.
Add a MainTableIDInt int column to RelatedTable.
The next step is to populate the RelatedTable.MainTableIDInt column for each row with its corresponding MainTable.IDInt value [based on the matching guid IDs]. This is the step I'm hung up on. I understand this is not going to be speedy, but I'd like to have it perform as well as possible.
I can write a SQL statement that does this update:
UPDATE RelatedTable
SET RelatedTable.MainTableIDInt = (SELECT MainTable.IDInt FROM MainTable WHERE MainTable.IDGuid = RelatedTable.MainTableIDGuid)
or
UPDATE RelatedTable
SET RelatedTable.MainTableIDInt = MainTable.IDInt
FROM RelatedTable
LEFT OUTER JOIN MainTable ON RelatedTable.MainTableIDGuid = MainTable.IDGuid
The 'Display Estimated Execution Plan' displays roughly the same for both of these queries. The execution plan it spits out does the following:
Clustered index scans over MainTable and RelatedTable and does a Merge Join on them [estimated number of rows = 400 million]
Sorts [estimated number of rows = 400 million]
Clustered index update over RelatedTable [estimated number of rows = 400 million]
I'm concerned about the performance of this [sorting 400 million rows sounds unpleasant]. Are my concerns about performance of these execution plan justified? Is there a better way to update the new ID for my related table that will scale given the size of the tables?

First, this will be a headache. Second, I wouldn't change any of the indexes or constraints until I had the data in place. I.e., I would add the identity column but not make it the primary key nor clustered index. Then I'd add the soon-to-be new foreign keys to the various tables. Your queries should look like:
Update ChildTable
Set NewIntForeignKeyId = P.NewIntPrimaryKey
From ChildTable As C
Join ParentTable As P
On P.PrimaryKey = C.ForeignKey
First, notice that I'm using an inner join. There is no reason to use an outer join for this type of query given that you will eventually enforce referential integrity between the new columns. Second, if you populate the columns first and then rebuild the constraints, it will be faster as you'll be able to leverage the existing indexes. Remember that when you change the clustered index, it rebuilds all of the nonclustered indexes. If the tables are large, that will be a serious hit.
Once you have the data in place, I'd then drop all primary constraints, unique constraints, foreign key constraints and unique indexes. Drop the clustered index/constraint last. I'd then add the clustered indexes to all of the tables and after that was done, recreate the unique constraints, foreign key constraints and indexes. If you do not drop the existing indexes before you recreate the clustered index, it will rebuild the existing indexes twice: once when you drop the clustered index and again when you recreate it.
Btw, I highly doubt there is a way to avoid table scans for this sort of thing since you are going to be updating every row.

Creating a Primary Key on a temp table - When?

I have a stored procedure that is working with a large amount of data. I have that data being inserted in to a temp table. The overall flow of events is something like
CREATE #TempTable (
Col1 NUMERIC(18,0) NOT NULL, --This will not be an identity column.
,Col2 INT NOT NULL,
,Col3 BIGINT,
,Col4 VARCHAR(25) NOT NULL,
--Etc...
--
--Create primary key here?
)
INSERT INTO #TempTable
SELECT ...
FROM MyTable
WHERE ...
INSERT INTO #TempTable
SELECT ...
FROM MyTable2
WHERE ...
--
-- ...or create primary key here?
My question is when is the best time to create a primary key on my #TempTable table? I theorized that I should create the primary key constraint/index after I insert all the data because the index needs to be reorganized as the primary key info is being created. But I realized that my underlining assumption might be wrong...
In case it is relevant, the data types I used are real. In the #TempTable table, Col1 and Col4 will be making up my primary key.
Update: In my case, I'm duplicating the primary key of the source tables. I know that the fields that will make up my primary key will always be unique. I have no concern about a failed alter table if I add the primary key at the end.
Though, this aside, my question still stands as which is faster assuming both would succeed?

This depends a lot.
If you make the primary key index clustered after the load, the entire table will be re-written as the clustered index isn't really an index, it is the logical order of the data. Your execution plan on the inserts is going to depend on the indexes in place when the plan is determined, and if the clustered index is in place, it will sort prior to the insert. You will typically see this in the execution plan.
If you make the primary key a simple constraint, it will be a regular (non-clustered) index and the table will simply be populated in whatever order the optimizer determines and the index updated.
I think the overall quickest performance (of this process to load temp table) is usually to write the data as a heap and then apply the (non-clustered) index.
However, as others have noted, the creation of the index could fail. Also, the temp table does not exist in isolation. Presumably there is a best index for reading the data from it for the next step. This index will need to either be in place or created. This is where you have to make a tradeoff of speed here for reliability (apply the PK and any other constraints first) and speed later (have at least the clustered index in place if you are going to have one).

If the recovery model of your database is set to simple or bulk-logged, SELECT ... INTO ... UNION ALL may be the fastest solution. SELECT .. INTO is a bulk operation and bulk operations are minimally logged.
eg:
-- first, create the table
SELECT ...
INTO #TempTable
FROM MyTable
WHERE ...
UNION ALL
SELECT ...
FROM MyTable2
WHERE ...
-- now, add a non-clustered primary key:
-- this will *not* recreate the table in the background
-- it will only create a separate index
-- the table will remain stored as a heap
ALTER TABLE #TempTable ADD PRIMARY KEY NONCLUSTERED (NonNullableKeyField)
-- alternatively:
-- this *will* recreate the table in the background
-- and reorder the rows according to the primary key
-- CLUSTERED key word is optional, primary keys are clustered by default
ALTER TABLE #TempTable ADD PRIMARY KEY CLUSTERED (NonNullableKeyField)
Otherwise, Cade Roux had good advice re: before or after.

You may as well create the primary key before the inserts - if the primary key is on an identity column then the inserts will be done sequentially anyway and there will be no difference.

Even more important than performance considerations, if you are not ABSOLUTELY, 100% sure that you will have unique values being inserted into the table, create the primary key first. Otherwise the primary key will fail to be created.
This prevents you from inserting duplicate/bad data.

If you add the primary key when creating the table, the first insert will be free (no checks required.) The second insert just has to see if it's different from the first. The third insert has to check two rows, and so on. The checks will be index lookups, because there's a unique constraint in place.
If you add the primary key after all the inserts, every row has to be matched against every other row. So my guess is that adding a primary key early on is cheaper.
But maybe Sql Server has a really smart way of checking uniqueness. So if you want to be sure, measure it!

I was wondering if I could improve a very very "expensive" stored procedure entailing a bunch of checks at each insert across tables and came across this answer. In the Sproc, several temp tables are opened and reference each other. I added the Primary Key to the CREATE TABLE statement (even though my selects use WHERE NOT EXISTS statements to insert data and ensure uniqueness) and my execution time was cut down SEVERELY. I highly recommend using the primary keys. Always at least try it out even when you think you don't need it.

I don't think it makes any significant difference in your case:
either you pay the penalty a little bit at a time, with each single insert
or you'll pay a larger penalty after all the inserts are done, but only once
When you create it up front before the inserts start, you could potentially catch PK violations as the data is being inserted, if the PK value isn't system-created.
But other than that - no big difference, really.
Marc

I wasn't planning to answer this, since I'm not 100% confident on my knowledge of this. But since it doesn't look like you are getting much response ...
My understanding is a PK is a unique index and when you insert each record, your index is updated and optimized. So ... if you add the data first, then create the index, the index is only optimized once.
So, if you are confident your data is clean (without duplicate PK data) then I'd say insert, then add the PK.
But if your data may have duplicate PK data, I'd say create the PK first, so it will bomb out ASAP.

When you add PK on table creation - the insert check is O(Tn) (where Tn is "n-th triangular number", which is 1 + 2 + 3 ... + n) because when you insert x-th row, it's checked against previously inserted "x - 1" rows
When you add PK after inserting all the values - the checker is O(n^2) because when you insert x-th row, it's checked against all n existing rows.
First one is obviously faster since O(Tn) is less than O(n^2)
P.S. Example: if you insert 5 rows it is 1 + 2 + 3 + 4 + 5 = 15 operations vs 5^2 = 25 operations

Introducing FOREIGN KEY constraint 'c_name' on table 't_name' may cause cycles or multiple cascade paths

I have a database table called Lesson:
columns: [LessonID, LessonNumber, Description] ...plus some other columns
I have another table called Lesson_ScoreBasedSelection:
columns: [LessonID,NextLessonID_1,NextLessonID_2,NextLessonID_3]
When a lesson is completed, its LessonID is looked up in the Lesson_ScoreBasedSelection table to get the three possible next lessons, each of which are associated with a particular range of scores. If the score was 0-33, the LessonID stored in NextLessonID_1 would be used. If the score was 34-66, the LessonID stored in NextLessonID_2 would be used, and so on.
I want to constrain all the columns in the Lesson_ScoreBasedSelection table with foreign keys referencing the LessonID column in the lesson table, since every value in the Lesson_ScoreBasedSelection table must have an entry in the LessonID column of the Lesson table. I also want cascade updates turned on, so that if a LessonID changes in the Lesson table, all references to it in the Lesson_ScoreBasedSelection table get updated.
This particular cascade update seems like a very straightforward, one-way update, but when I try to apply a foreign key constraint to each field in the Lesson_ScoreBasedSelection table referencing the LessonID field in the Lesson table, I get the error:
Introducing FOREIGN KEY constraint 'c_name' on table 'Lesson_ScoreBasedSelection' may cause cycles or multiple cascade paths.
Can anyone explain why I'm getting this error or how I can achieve the constraints and cascading updating I described?

You can't have more than one cascading RI link to a single table in any given linked table. Microsoft explains this:
You receive this error message because
in SQL Server, a table cannot appear
more than one time in a list of all
the cascading referential actions that
are started by either a DELETE or an
UPDATE statement. For example, the
tree of cascading referential actions
must only have one path to a
particular table on the cascading
referential actions tree.

Given the SQL Server constraint on this, why don't you solve this problem by creating a table with SelectionID (PK), LessonID, Next_LessonID, QualifyingScore as the columns. Use a constraint to ensure LessonID and QualifyingScore are unique.
In the QualifyingScore column, I'd use a tinyint, and make it 0, 1, or 2. That, or you could do a QualifyingMinScore and QualifyingMaxScore column so you could say,
SELECT * FROM NextLesson
WHERE LessonID = #MyLesson
AND QualifyingMinScore <= #MyScore
AND #MyScore <= QualifyingMaxScore
Cheers,
Eric

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight