Suppose I have a SQL Server table that has millions of rows and receives over 2000 inserts per minute. A separate process needs to do a bulk update on this table, let's say with a where clause that will update 1000 rows. But it doesn't care about performance and could optionally run 1000 single-row updates using the primary key.
If the bulk update runs too long, it will block the incoming insertions, right? Whereas updating rows individually will allow insertions to squeak through the cracks and not block? So from the standpoint of optimizing performance for the insertions, am I better off running the updates one row at a time?
Updates will not block the insert but you might get an unexpected behavior if the where condition of the where condition is not applied to the new inserted rows.. So it's better to review the logic of the application to make sure that the new inserted rows are not needed in the update.
But in general the bulk update is much better than single updates.
Related
I have a long running stored procedure with lot of statements. After analyzing identified few statements which are taking most time. Those statements are all update statements.
Looking at the execution plan, the query scans the source table in parallel in few seconds, and then passed it to gather streams operation which then passes to
This is somewhat similar to below, and we see same behavior with the index creation statements too causing slowness.
https://brentozar.com/archive/2019/01/why-do-some-indexes-create-faster-than-others/
Table has 60 million records and is a heap as we do lot of data loads, updates and deletes.
Reading the source is not a problem as it completes in few seconds, but actual update which happens serially is taking most time.
A few suggestions to try:
if you have indexes on the target table, dropping them before and recreating after should improve insert performance.
Add insert into [Table] with (tablock) hint to the table you are inserting into, this will enable sql server to lock the table exclusively and will allow the insert to also run in parallel.
Alternatively if that doesn't yield an improvement try adding a maxdop 1 hint to the query.
How often do you UPDATE the rows in this heap?
Because, unlike clustered indexes, heaps will use a RID to find specific rows. But the thing is that (unless you specifically rebuild this) when you update a row, the last row will still remain where it was and now point to the new location instead, increasing the number of lookups that is needed for each time you perform an update on a row.
I don't really think that is something that will be affected here, but could you possible see what happens if you add a clustered index on the table and see how the update times are affected?
Also, I don't assume you got some heavy trigger on the table, doing a bunch of stuff as well, right?
Additionally, since you are referring to an article by Brent Ozar, he does advocate to break updates into batches of no more than 4000 rows a time, as that has both been proven to be the fastest and will be below the 5000 rows X-lock that will occur during updates.
I am looking for much more better way to update tables using SSIS. Specifically, i wanted to optimize the updates on tables (around 10 tables uses same logic).
The logic is,
Select the source data from staging then inserts into physical temp table in the DW (i.e TMP_Tbl)
Update all data matching by customerId column from TMP_Tbl to MyTbl.
Inserts all non-existing customerId column from TMP_Tbl1 to MyTbl.
Using the above steps, this takes some time populating TMP_Tbl. Hence, i planned to change the logic to delete-insert but according to this:
In SQL, is UPDATE always faster than DELETE+INSERT? this would be a recipe for pain.
Given:
no index/keys used on the tables
some tables contains 5M rows, some contains 2k rows
each table update took up to 2-3 minutes, which took for about (15 to 20 minutes) all in all
these updates we're in separate sequence container simultaneously runs
Anyone knows what's the best way to use, seems like using physical temp table needs to be remove, is this normal?
With SSIS you usually BULK INSERT, not INSERT. So if you do not mind DELETE - reinserting the rows should in general outperform UPDATE.
Considering this the faster approach will be:
[Execute SQL Task] Delete all records which you need to update. (Depending on your DB design and queries, some index may help here).
[Data Flow Task] Fast load (using OLE DB Destination, Data access mode: Table of fiew - fast load) both updated and new records from source into MyTbl. No need for temp tables here.
If you cannot/don't want to DELETE records - your current approach is OK too.
You just need to fix the performance of that UPDATE query (adding an index should help). 2-3 minutes per every record updated is way too long.
If it is 2-3 minutes for updating millions of records though - then it's acceptable.
Adding the correct non-clustered index to a table should not result in "much more time on the updates".
There will be a slight overhead, but if it helps your UPDATE to seek instead of scanning a big table - it is usually well worth it.
If I have millions rows to update in sql server, then how do I proceed? Is there a logical method for bulk update as it will otherwise lock the table for a long time.
You can run the code as a bulkadmin user which will mean that triggers won't fire and will make the insert table less time if you have triggers. The down side of this is that triggers won't fire and you may need to have them fire or to write code to do what the tirgger might have done.
You can run the update in batches (of say 10,000) so that you you are not updating millions of rows in one transaction. Make sure each batch is in it's own transaction. It is often best to do this during nonpeak hours as well.
You can make sure your code is written so that it will not update rows that don't need updating. For instance you might see an update statment like this:
Update table1
set somefield = 0
where you might need
Update table1
set somefield = 0
where somefield <> 0
There is a huge performance difference between updating a million rows and the 35 that needed update.
If your update is from a file, use SSIS which is optimized for high performance if you write the package corectly.
The interviewer(s) probably wanted to gauge your knowledge of lock granularity. If the "bulk" update would otherwise lock the table, then you could update smaller subsets of the table data, with each one in a separate transaction until all necessary rows were updated. Smaller subsets of data could allow the updates to hold lesser locks, such as extent, page, RID locks, etc.
I have reports that perform some time consuming data calculations for each user in my database, and the result is 10 to 20 calculated new records for each user. To improve report responsiveness, a nightly job was created to run the calculations and dump the results to a snapshot table in the database. It only runs for active users.
So with 50k users, 30k of which are active, the job "updates" 300k to 600k records in the large snapshot table. The method it currently uses is it deletes all previous records for a given user, then inserts the new set. There is no PK on the table, only a business key is used to group the sets of data.
So my question is, when removing and adding up to 600k records every night, are there techniques to optimize the table to handle this? For instance, since the data can be recreated on demand, is there a way to disable logging for the table as these changes are made?
UPDATE:
One issue is I cannot do this in batch because the way the script works, it's examining one user at a time, so it looks at a user, deletes the previous 10-20 records, and inserts a new set of 10-20 records. It does this over and over. I am worried that the transaction log will run out of space or other performance issues could occur. I would like to configure the table to now worry about data preservation or other items that could slow it down. I cannot drop the indexes and all that because people are accessing the table concurrently to it being updated.
It's also worth noting that indexing could potentially speed up this bulk update rather than slow it down, because UPDATE and DELETE statements still need to be able to locate the affected rows in the first place, and without appropriate indexes it will resort to table scans.
I would, at the very least, consider a non-clustered index on the column(s) that identify the user, and (assuming you are using 2008) consider the MERGE statement, which can definitely avoid the shortcomings of the mass DELETE/INSERT method currently employed.
According to The Data Loading Performance Guide (MSDN), MERGE is minimally logged for inserts with the use of a trace flag.
I won't say too much more until I know which version of SQL Server you are using.
This is called Bulk Insert, you have to drop all indexes in destination table and send insert commands in large packs (hundreds of insert statements) separated by ;
Another way is to use BULK INSERT statement http://msdn.microsoft.com/en-us/library/ms188365.aspx
but it involves dumping data to file.
See also: Bulk Insert Sql Server millions of record
It really depends upon many things
speed of your machine
size of the records being processed
network speed
etc.
Generally it is quicker to add records to a "heap" or an un-indexed table. So dropping all of your indexes and re-creating them after the load may improve your performance.
Partitioning the table may see performance benefits if you partition by active and inactive users (although the data set may be a little small for this)
Ensure you test how long each tweak adds or reduces your load and work from there.
I have a very large database, little over 60 gigs, with many tables with millions of rows. I am getting some timeout errors, so I am rethinking some of my code design.
Currently, my pseduo code is like this:
delete from table where person=123 (deletes about 200 rows)
Then I re-insert the updated data (again, 200 rows). The data is always different, as it's time sensitive.
If I was to do an update, instead of insert, I'd have to select the row first (I'm using an ORM in c#).
tl;dr
I am just wondering, simple question, what is more cost effective.
Select / Update or Delete/Insert?
If you update any column that is part of the clustered index key then your update is handled internally as a delete/insert anyway
How would you handle the difference in cardinality with an UPDATE? Ie. person=123 has 200 rows to delete, but only 199 to insert. Update would not be able to handle this.
Your best approach should be to use a MERGE statement and a table valued parameter with the new values. Of course, no ORM can handle this, but you mention 'performance', and the terms 'performance' and 'ORM' cannot be used in the same sentence...
With Delete/Insert, you will be writing to the database twice. One time to delete and one time to insert. You will also be logging both of those transactions separately, unless you are properly wrapping the entire process in a single transaction.
You could test both methods and watch the results in SQL Profiler, but 9/10 Update will be quicker.
Could of cavets, I'd make sure the person key is indexed so that you are not doing a complete table scan to find the affected records.
Finally, as #Mundu say, you may want to do this using a parametrized query via ADO.NET instead of the ORM.