Performance by using UPDATE / UPDATE TOP (1) - sql-server

I'm normally using:
UPDATE table1 SET field1='test' WHERE ID=10
But will it be more efficiently to use the following statement:
UPDATE TOP (1) table1 SET field1='test' WHERE ID=10
if I have a lots of records?
The ID Column is a primary key and autoincremented too.

If the ID column is a Primary Key, then there will be at most a single record affected by your UPDATE query.
If your Primary Key is by default a Clustered Index, then the performance should be similar in both cases.
Even if when creating your PK, you specify it as non-clustered, then you still get a performance boost when searching / selecting / identifying / filtering records (because you're using WHERE). This might not be as fast as the clustered index PK, but the performance difference should be negligible.
When creating a PK, you're forced to pick one of the two indexing types for your key, as mentioned and explained here in more detail.
Hence, both versions of the UPDATE query should have similar performance (possibly small differences when running on different occasions due to other ancillary operations).
In conclusion:
If you have a Primary Key on your ID column, and you're using it in the FILTERING part of the query (WHERE), then you should be fine when you're querying thousands, millions and possibly even up to billions of records.
Disclaimer:
The performance / speed of the UPDATE query also depends on what other indexes need to be updated, due to the changing values (indexes which contain the field1 as their key), triggers on your table, cascading rules for foreign keys etc.

Related

SQL - define keys to table

Is there any considerations to define keys for table that has lot of records already and most of operation that are operated on it are Insert ?
Key definition ultimately comes down to how you can uniquely and efficiently identify any specific row in a table. If a business key value fulfills that requirement, then it is a suitable candidate. An ideal key is also skinny. A GUID is horrible for this (IMHO) because it is far larger than it needs to be.
If insert performance is the most important priority and a suitable business key is not available, you can use an integer based identity key. If you expect more than 2.1 billion records within a few years, use bigint (9 quintillion records) instead.
Keep in mind that every index you make on the table will always include the PK. Having a skinny PK can make your indexes more efficient, using less storage, memory and CPU.
Insert speed is affected by the clustered index sort order as well as the number and sort order of all non-clustered indexes on the table. Column-store indexes are not sorted and have minimal overhead on inserts.
If you have a PK that store ID-number is more heavy then auto increases number, therefore when you define key keep in mind that it bather to create another column of PK for auto increases number.

Proper table design for sparse primary key

In my system I have temporary entities that are created based on rules stored in my database, and the entities are not persisted.
Now, I need is to store information about these entities, and because they are created based on rules and are not stored, they have no ID.
I came up with a formula to generate an ID for these temp entities based on the rule that was used to generate them: id = rule id + "-" + entity index in the rule. This formula generates unique strings of the form 164-3, 123-0, 432-2, etc...
My question is how should I build my table (regarding primary key and clustered index) when my keys have no relation or order? Keep in mind that I will only (99.9% of the time) query the table using the id mentioned above.
Options I thought about after much reading, but don't have the knowledge to determine which is better:
1) primary key on a varchar column with clustered index. -According to various sources, this would be bad because of fragmentation and the wideness of the key. Also their format is pretty weird for sorting.
2) primary key on varchar column without clustered index (heap table). -Also a bad idea according to various sources due to indexing and fragmentation issues.
3) identity int column with clustered index, and a varchar column as primary key with unique index. -Can't really see the benefit of the surogate key here since it would mainly help with range queries and ordering and I would never query the table based on this key because it would be unknown at all times.
4) 2 columns composite key: rule id + rule index columns.
-Now I don't have strings but I have two columns that will be copied to FKs and non clustered indexes. Also I'm not sure what indexes I would use in this case.
Can anybody shine a light here? Any help is appreciated.
--Edit
I will perform more selects than inserts;
I will perform more inserts than updates;
All selects will include at least rule id;
If I use a surogate primary key, and a unique index on (rule id, index), then I can use the surogate for subsequent operations after retrieving data by rule id, which would be faster. Also, inserts would be faster.
However, because the data will be stored according to the surogate key, I might have records that have the same rule id, but different index, stored quite far from each other on disk, which means even with an index on rule id, retrieving the data could be kinda slow.
If I use (rule id, index) as clustered primary key, rows with same rule id would be stored close to each other, and selecting data by rule id would be efficient enough. However, I suspect inserts would be slow.
Is the rationale above correct?
Using a heap is generally a bad idea unless proven otherwise. Even so, you will need a very solid reason for not having a clustered index (any one will make things better, even on identity column).
Storing this key in a single column is okay; if you want natural sorting, you can pad your numbers with zeroes, for example. However, this will widen the key.
Having a composite primary key (and, subsequently, foreign keys) is completely acceptable, especially when dealing with natural keys, like the one you have. This will give you the narrowest possible key - int + int or some such - while eliminating the sorting issue at the same time. I would recommend to make this PK clustered to reduce additional key lookups.
Fragmentation here will not be a big issue; at least, no bigger than with any other indexing decision. Any index built on such a key will be prone to fragmentation, clustered or no. In any case, your DBA should know how to keep an index such as this in top form.
Regarding the order of columns in the index, the following rules usually apply:
If partial key match will take place (filtering by one part of the key but not by the other) the one which is used most often should go first;
If No.1 isn't applicable and all parts of the key used in all queries, the column with the highest cardinality should go first.
The order of remaining columns (if there are more than 1) isn't of much importance because SQL Server only creates distribution statistics for the first column in a composite index. However, it is a good idea to list them in order of decreasing cardinality.
EDIT: Seeing your update with additional details, here are the most suitable options. Suppose your table looks like this:
-- Sample table
create table dbo.TempEntities (
RuleId int not null,
IndexId int not null,
-- Remaining columns listed here
EntityData xml not null
);
go
From here, the most straightforward way is to use the natural key as a clustered index:
-- Option 1 - natural clustered index
alter table dbo.TempEntities
add constraint PK_TempEntities primary key clustered (RuleId, IndexId);
go
However, if you have any child tables that would reference this one, it might not be the most convenient solution, because natural keys are prone to updates, which creates a mess where you could avoid it. Instead, a surrogate key can be introduced, like this:
-- Option 2 - surrogate clustered, natural nonclustered
alter table dbo.TempEntities add Id bigint identity(1,1) not null;
alter table dbo.TempEntities
add constraint PK_TempEntities primary key clustered (Id);
alter table dbo.TempEntities
add constraint UQ_TempEntities_RuleIdIndexId unique (RuleId, IndexId);
go
It makes sense to have the surrogate PK clustered, because it will result in much less page splits, making inserts faster (despite having one index more compared to Option 1). Without any intimate knowledge of your queries, this is probably the most balanced solution.
Shuffling the clustered attribute between surrogate and natural keys has mostly academic value and can only make difference on a high-load system with hundreds of inserts happening every second on 24*7 schedule. If your system is indeed as such, please seek a professional consultant who will analyse your queries and provide the solution tailored to your situation.

Oracle DB, creating unique constraint on multiple columns for insert , how about performance

Friends,
Am new to DB venture, I needed some help/information.
There is a table in our project say "record_table" , values in it is inserted using C++ code.
This table has multiple columns, out of which three columns say for eg. "serialNo, type, sub_type" that C++ code is inserting duplicate values for combination of these columns( these columns are no where unique or primary for that table). But the combination of 3 columns should be unique.
Now we want to make sure duplicates for this combination shouldnt be inserted. I was thinking of adding unique constraint for these columns so that when new record is to be inserted with these duplicated values it will not allow to .
I assume this should work, but I have a doubt will it hit the performance, the C++ binary runs daily and it inserts around 2 million records. Will creating unique constraint hit performance.(Mean will the run time slow down or since the table has millions of records will creating unique constraint make no sense as it has to make a hash of these columns etc)
Please suggest if you can.
Unique constraints are enforced through an index. Chances are you need that index anyway, for querying the data back again, so the overhead of maintaining it is irrelevant.
The real question is, what is the performance impact of handling duplicate records if you don't enforce the constraint? Generally speaking the performance impact of enforcing constraints is trivial compared to fixing data corruption.

Using a meaningless ID as my clustered index rather than my primary key

I'm working in SQL Server 2008 R2
As part of a complete schema rebuild, I am creating a table that will be used to store advertising campaign performance by zipcode by day. The table setup I'm thinking of is something like this:
CREATE TABLE [dbo].[Zip_Perf_by_Day] (
[CampaignID] int NOT NULL,
[ZipCode] int NOT NULL,
[ReportDate] date NOT NULL,
[PerformanceMetric1] int NOT NULL,
[PerformanceMetric2] int NOT NULL,
[PerformanceMetric3] int NOT NULL,
and so on... )
Now the combination of CampaignID, ZipCode, and ReportDate is a perfect natural key, they uniquely identify a single entity, and there shouldn't be 2 records for the same combination of values. Also, almost all of my queries to this table are going to be filtered on 1 or more of these 3 columns. However, when thinking about my clustered index for this table, I run into a problem. These 3 columns do not increment over time. ReportDate is OK, but CampaignID and Zipcode are going to be all over the place while inserting rows. I can't even order them ahead of time because results come in from different sources during the day, so data for CampaignID 50000 might be inserted at 10am, and CampaignID 30000 might come in at 2pm. If I use the PK as my clustered index, I'm going to run into fragmentation problems.
So I was thinking that I need an Identity ID column, let's call it PerformanceID. I can see no case where I would ever use PerformanceID in either the select list or where clause of any query. Should I use PerformanceID as my PK and clustered index, and then set up a unique constraint and non-clustered indexes on CampaignID, ZipCode, and ReportDate? Should I keep those 3 columns as my PK and just have my clustered index on PerformanceID? (<- This is the option I'm leaning towards right now) Is it OK to have a slightly fragmented table? Is there another option I haven't considered? I am looking for what would give me the best read performance, while not completely destroying write performance.
Some actual usage information. This table will get written to in batches. Feeds come in at various times during the day, they get processed, and this table gets written to. It's going to get heavily read, as by-day performance is important around here. When I fill this table, it should have about 5 million rows, and will grow at a pace of about 8,000 - 10,000 rows per day.
In my experience, you probably do want to use another INT Identity field as your clustered index key. I would also add a UNIQUE constraint to that one (it helps with execution plans).
A big part of the reason is space - if you use a 3 field key for your clustered index, you will have all 3 fields in every row of every non-clustered index on that table (as your clustered index row identifier). If you only plan to have a couple of indexes that isn't a big deal, but if you have a lot of them it can make a big difference. The more data per row, the more pages needed and the more IO you have.
Fragmentation is a very real issue that can cause major performance problems, especially as the table grows.
Having that additional cluster key will also mean writes will be faster for your inserts. All new rows will go to the end of your table, which means existing rows won't be touched or rearranged.
If you want to use those three fields as a FK in other tables, then by all means have them as your PK.
For the most part it doesn't really matter if you ever directly reference your clustered index key. As long as it is narrow, increasing, and unique you should be in good shape.
EDIT:
As Damien points out in the comments, if you will be filtering on single fields of your PK, you will need to have an index on each one (or always use the first field in the covering index).
On the information given (ReportDate, CampaignID, ZipCode) or (ReportDate, ZipCode, CampaignID) seem like better candidates for the clustered index than a surrogate key. Defragmentation would be a potential concern if the time taken to rebuild indexes became prohibitive but given the sizes I would expect for this table (10s or 1000s rather than 1,000,000s of rows per day) that seems unlikely to be an issue.
If I understood all you have written correctly you are opting out of natural clustering due to fragmentation penalties.
For this purpose you consider meaningless IDs which will:
avoid insert penalties for clustered index when inserting out of order batches (great for write performance)
guarantee that your data is fragmented for reads that put conditions on the natural key (not so good for read performance)
JNK point's out that fragmentation can be a real issue, however you need to establish a baseline against which you will measure and you need to establish if reading or writing is more important to you (or how important they are in measurable terms).
There's nothing that will beat a good test case - so finally that is the best recommendation I can give.
With databases it is often relatively easy to build scripts that will create real benchmarks with real workloads and realistic data quantities.

Should I get rid of clustered indexes on Guid columns

I am working on a database that usually uses GUIDs as primary keys.
By default SQL Server places a clustered index on primary key columns. I understand that this is a silly idea for GUID columns, and that non-clustered indexes are better.
What do you think - should I get rid of all the clustered indexes and replace them with non-clustered indexes?
Why wouldn't SQL's performance tuner offer this as a recommendation?
A big reason for a clustered index is when you often want to retrieve rows for a range of values for a given column. Because the data is physically arranged in that order, the rows can be extracted very efficiently.
Something like a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.
So yes, don't cluster an index on GUID.
As to why it's not offered as a recommendation, I'd suggest the tuner is aware of this fact.
You almost certainly want to establish a clustered index on every table in your database.
If a table does not have a clustered index it is what is referred to as a "Heap" and performance of most types of common queries is less for a heap than for a clustered index table.
Which fields the clustered index should be established on depend on the table itself, and the expected usage patterns of queries against the table. In almost every case you probably want the clustered index to be on a column or a combination of columns that is unique, i.e., (an alternate key), because if it isn't, SQL will add a unique value to the end of whatever fields you select anyway. If your table has a column or columns in it that will be frequently used by queries to select or filter multiple records, (for example if your table contains sales transactions, and your application will frequently request sales transactions by product Id, or even better, a Invoice details table, where in almost every case you will be retrieving all the detail records for a specific invoice, or an invoice table where you often retrieve all the invoices for a particular customer... This is true whether you will be selected large numbers of records by a single value, or by a range of values)
These columns are candidates for the clustered index. The order of the columns in the clustered index is critical.. The first column defined in the index should be the column that will be selected or filtered on first in expected queries.
The reason for all this is based on understanding the internal structure of a database index. These indices are called balanced-tree (B-Tree) indices. they are kinda like a binary tree, except that each node in the tree can have an arbitrary number of entries, (and child nodes), instead of just two. What makes a clustered index different is that the leaf nodes in a clustered index are the actual physical disk data pages of the table itself. whereas the leaf nodes of the non-clustered index just "point" to the tables' data pages.
When a table has a clustered index, therefore, the tables data pages are the leaf level of that index, and each one has a pointer to the previous page and the next page in the index order (they form a doubly-linked-list).
So if your query requests a range of rows that is in the same order as the clustered index... the processor only has to traverse the index once (or maybe twice), to find the start page of the data, and then follow the linked list pointers to get to the next page and the next page, until it has read all the data pages it needs.
For a non-clustered index, it has to traverse the index once for every row it retrieves...
NOTE: EDIT
To address the sequential issue for Guid Key columns, be aware that SQL2k5 has NEWSEQUENTIALID() that does in fact generate Guids the "old" sequential way.
or you can investigate Jimmy Nielsens COMB guid algotithm that is implemented in client side code:
COMB Guids
The problem with clustered indexes in a GUID field are that the GUIDs are random, so when a new record is inserted, a significant portion of the data on disk has to be moved to insert the records into the middle of the table.
However, with integer-based clustered indexes, the integers are normally sequential (like with an IDENTITY spec), so they just get added to the end an no data needs to be moved around.
On the other hand, clustered indexes are not always bad on GUIDs... it all depends upon the needs of your application. If you need to be able to SELECT records quickly, then use a clustered index... the INSERT speed will suffer, but the SELECT speed will be improved.
While clustering on a GUID is normally a bad idea, be aware that GUIDs can under some circumstances cause fragmentation even in non-clustered indexes.
Note that if you're using SQL Server 2005, the newsequentialid() function produces sequential GUIDs. This helps to prevent the fragmentation problem.
I suggest using a SQL query like the following to measure fragmentation before making any decisions (excuse the non-ANSI syntax):
SELECT OBJECT_NAME (ips.[object_id]) AS 'Object Name',
si.name AS 'Index Name',
ROUND (ips.avg_fragmentation_in_percent, 2) AS 'Fragmentation',
ips.page_count AS 'Pages',
ROUND (ips.avg_page_space_used_in_percent, 2) AS 'Page Density'
FROM sys.dm_db_index_physical_stats
(DB_ID ('MyDatabase'), NULL, NULL, NULL, 'DETAILED') ips
CROSS APPLY sys.indexes si
WHERE si.object_id = ips.object_id
AND si.index_id = ips.index_id
AND ips.index_level = 0;
If you are using NewId(), you could switch to NewSequentialId(). That should help the insert perf.
Yes, there's no point in having a clustered index on a random value.
You probably do want clustered indexes SOMEWHERE in your database. For example, if you have a "Author" table and a "Book" table with a foreign key to "Author", and if you have a query in your application that says, "select ... from Book where AuthorId = ..", then you would be reading a set of books. It will be faster if those book are physically next to each other on the disk, so that the disk head doesn't have to bounce around from sector to sector gathering all the books of that author.
So, you need to think about your application, the ways in which it queries the database.
Make the changes.
And then test, because you never know...
As most have mentioned, avoid using a random identifier in a clustered index-you will not gain the benefits of clustering. Actually, you will experience an increased delay. Getting rid of all of them is solid advice. Also keep in mind newsequentialid() can be extremely problematic in a multi-master replication scenario. If database A and B both invoke newsequentialid() prior to replication, you will have a conflict.
Yes you should remove the clustered index on GUID primary keys for the reasons Galwegian states above. We have done this on our applications.
It depends if you're doing a lot of inserts, or if you need very quick lookup by PK.

Resources