If I have millions rows to update in sql server, then how do I proceed? Is there a logical method for bulk update as it will otherwise lock the table for a long time.
You can run the code as a bulkadmin user which will mean that triggers won't fire and will make the insert table less time if you have triggers. The down side of this is that triggers won't fire and you may need to have them fire or to write code to do what the tirgger might have done.
You can run the update in batches (of say 10,000) so that you you are not updating millions of rows in one transaction. Make sure each batch is in it's own transaction. It is often best to do this during nonpeak hours as well.
You can make sure your code is written so that it will not update rows that don't need updating. For instance you might see an update statment like this:
Update table1
set somefield = 0
where you might need
Update table1
set somefield = 0
where somefield <> 0
There is a huge performance difference between updating a million rows and the 35 that needed update.
If your update is from a file, use SSIS which is optimized for high performance if you write the package corectly.
The interviewer(s) probably wanted to gauge your knowledge of lock granularity. If the "bulk" update would otherwise lock the table, then you could update smaller subsets of the table data, with each one in a separate transaction until all necessary rows were updated. Smaller subsets of data could allow the updates to hold lesser locks, such as extent, page, RID locks, etc.
Related
I have a long running stored procedure with lot of statements. After analyzing identified few statements which are taking most time. Those statements are all update statements.
Looking at the execution plan, the query scans the source table in parallel in few seconds, and then passed it to gather streams operation which then passes to
This is somewhat similar to below, and we see same behavior with the index creation statements too causing slowness.
https://brentozar.com/archive/2019/01/why-do-some-indexes-create-faster-than-others/
Table has 60 million records and is a heap as we do lot of data loads, updates and deletes.
Reading the source is not a problem as it completes in few seconds, but actual update which happens serially is taking most time.
A few suggestions to try:
if you have indexes on the target table, dropping them before and recreating after should improve insert performance.
Add insert into [Table] with (tablock) hint to the table you are inserting into, this will enable sql server to lock the table exclusively and will allow the insert to also run in parallel.
Alternatively if that doesn't yield an improvement try adding a maxdop 1 hint to the query.
How often do you UPDATE the rows in this heap?
Because, unlike clustered indexes, heaps will use a RID to find specific rows. But the thing is that (unless you specifically rebuild this) when you update a row, the last row will still remain where it was and now point to the new location instead, increasing the number of lookups that is needed for each time you perform an update on a row.
I don't really think that is something that will be affected here, but could you possible see what happens if you add a clustered index on the table and see how the update times are affected?
Also, I don't assume you got some heavy trigger on the table, doing a bunch of stuff as well, right?
Additionally, since you are referring to an article by Brent Ozar, he does advocate to break updates into batches of no more than 4000 rows a time, as that has both been proven to be the fastest and will be below the 5000 rows X-lock that will occur during updates.
description
I use Postgres together with python3
There are 17 million rows in the table, the max ID 3000 million+
My task is select id,link from table where data is null;.And do some codes them Update table set data = %s where id = %s.
I tested a single data update needs 0.1s.
my thoughts
The following is my idea
Try a new database, I heard radis soon.But i don't know how to do.
In addition,what is the best number of connections?
I used to made 5-6 connections.
Now only two connections, but better.One hour updated 2million data.
If there is any way you can push the calculation of the new value into the database, i.e. issue a single large UPDATE statement like
UPDATE "table"
SET data = [calculation here]
WHERE data IS NULL;
you would be much faster.
But for the rest of this discussion I'll assume that you have to calculate the new values in your code, i.e. run one SELECT to get all the rows where data IS NULL and then issue a lot of UPDATE statements, each targeting a single row.
In that case, there are two ways how you can speed up processing considerable:
Avoid index updates
Updating an index is more expensive than adding a tuple to the table itself (the appropriately so-called heap, onto which it is quick and easy to pile up entries). So by avoiding index updates, you will be much faster.
There are two ways to avoid index updates:
Drop all indexes after selecting the rows to change and before the UPDATEs and recreate them after processing is completed.
This will be a net win if you update enough rows.
Make sure that there is no index on data and that the tables have been created with a fillfactor of less then 50. Then there is room enough in the data pages to write the update into the same page as the original row version, which obviates the need to update the index (this is known as a HOT update).
This is probably not an option for you, since you probably didn't create the table with a fillfactor like that, but I wanted to add it for completeness' sake.
Bundle many updates in a single transaction
By default, each UPDATE will run in its own transaction, which is committed at the end of the statement. However, each COMMIT forces the transaction log (WAL) to be written out to disk, which slows down processing considerably.
You do that by explicitly issuing a BEGIN before the first UPDATE and a COMMIT after the last one. That will also make the whole operation atomic, so that all changes are undone automatically if processing is interrupted.
Suppose I have a SQL Server table that has millions of rows and receives over 2000 inserts per minute. A separate process needs to do a bulk update on this table, let's say with a where clause that will update 1000 rows. But it doesn't care about performance and could optionally run 1000 single-row updates using the primary key.
If the bulk update runs too long, it will block the incoming insertions, right? Whereas updating rows individually will allow insertions to squeak through the cracks and not block? So from the standpoint of optimizing performance for the insertions, am I better off running the updates one row at a time?
Updates will not block the insert but you might get an unexpected behavior if the where condition of the where condition is not applied to the new inserted rows.. So it's better to review the logic of the application to make sure that the new inserted rows are not needed in the update.
But in general the bulk update is much better than single updates.
I have to insert one record per tables across 30 tables. The data coming from some other System. I have to insert data in the tables for the first time, then if any update happened, then I need to update tables in the SQL Server. I have two options:
a) I can check timestamp for individual table rows and update if the timestamp is greater.
b) Everytime I can stateway delete records and insert data.
Which one will be faster in SQL Server Database? Is there any other option to address the situatation?
If you are not changing the index fields of the record, the stategy of trying to update first and then insert is usually faster than drop/insert as you don't force the database into updating a bunch of index info.
If using Sql2008+ you should be using the merge command, as it explictly handles the update/insert condition cleanly and clearly
ADDED
I should also add that is you know the usage pattern in rarely update (i.e., 90% insert), you may have a case when drop/insert in faster than update/insert -- depends on lots of details. Regardless, merge is the clear winner if using 2008+
I generally like drop and re-insert. I find it to be cleaner and easier to code. However, if this is happening very frequently and you're worried about concurrency issues, you're probably better off with option 1.
Also, another thing to factor in is how often does the timestamp check fail (where you don't have to insert nor update). If 99% of data is redundant/outdated data, you're probably better off with option 1 regardless.
I'm wondering what is the correct solution to the below is.
I have an UPDATE statement in T-SQL that needs to be run as a daily task. The procedure will update one bit column in one table. Rows affected is around 30,000.
A pseudo version of the T-SQL
UPDATE TABLE_NAME
SET BIT_FIELD = [dbo].[FUNCTION](TABLE_NAME.ID)
WHERE -- THIS ISN'T RELEVANT
The function that determines true or false basically runs a few checks and hits around 3 other tables. Currently the procedure takes about 30 minutes to run and update 30,000 rows in our development environment. I was expecting this to double on production.
The problem I'm having is that intermittently TABLE_NAME table locks up. If I run it in batches of 1000 it seems ok but if I increase this it appears to run fine but eventually the table locks up. The only resolution is to cancel the query which results in no rows being updated.
Please note that the procedure is not wrapped in a TRANSACTION.
If I run each update in a separate UPDATE statement would this fix it? What would be a good solution when updating quite a large number of records in a live environment?
Any help would be much appreciated.
Thanks!
In your case, the SQL Server Optimizer has probably determined that a table lock is needed to perform the update of your table. You should perform rework on your query so that this table lock will not occur or will have a smaller impact on your users. So in a practical way this means: (a) speed up your query and (b) make sure the table will not lock.
Personally I would consider the following:
1. Create clustered and non-clustered indexes on your tables in order to improve the performance of your query.
2. See if it is possible to not use a function, but instead use joins, they are typically a lot faster.
3. Break up the update in multiple parts and perform these parts separately. You might have an 'or' satement in your 'where' clause, that is a good splitting point, but you can also consider creating a cursor to loop through the table and perform the update at one record at a time.