This is for SQL Server 2012. I have been asked to optimize a stored procedure that is used for calculating billing amount for various invoices.
This is a huge stored procedure with a lot of insert, update and delete queries with joins on multiple tables.
The stored procedure is getting stuck on this particular update query for 8-9 hours. The query is updating around 400000 records.
At the time of stored procedure execution, there is no other connection with the database except the one running the stored procedure.
The query is
UPDATE D
SET Amount=CASE WHEN D.Container='PCS' AND DP.AmountPerStop =0 THEN DP1.Amount*ISNULL(ISNULL(M2.MPDFactor, M.MPDFactor), 1) ELSE DP1.Amount+DP.Amount*(DeliveryQty-1) END,
InvoiceID=I.InvoiceID,
MPDFactor=CASE WHEN D.Container='PCS' AND DP.AmountPerStop =0 THEN ISNULL(M2.MPDFactor, M.MPDFactor) ELSE NULL END,
PickupAmountCap = DP.PickupAmountCap
FROM dbo.tDeliveries D
JOIN tDeliveryPrices DP ON DP.ProductGroupCode=D.ProductGroupCode AND DP.Container=D.Container AND DP.Zone=D.Zone
JOIN tVendorAgreements A ON A.DIP=D.DIP AND DP.VendorAgreementID=A.VendorAgreementID
JOIN tDeliveryPrices DP1 ON DP1.ProductGroupCode=D.ProductGroupCode AND DP1.Container=D.Container AND DP1.Zone=D.Zone AND DP1.VendorAgreementID=A.VendorAgreementID
JOIN tDIP DIP ON D.Dip=DIP.Dip
JOIN tInvoices I ON A.VendorAgreementID=I.VendorAgreementID
JOIN #tDailyInvoicePeriodForDIP IP ON IP.DIP = DIP.DIP
LEFT JOIN tMPDFactors M ON M.DIPArea=DIP.DipArea AND M.ProductGroupCode=D.ProductGroupCode AND D.DeliveryQty BETWEEN M.StartQty AND M.EndQty
LEFT JOIN tMPDFactors M2 ON M2.DIP=DIP.Dip AND M2.ProductGroupCode=D.ProductGroupCode AND D.DeliveryQty BETWEEN M2.StartQty AND M2.EndQty
WHERE D.InvoiceID IS NULL AND
I.InvoicePeriod= #Period AND
I.InvoiceLockedDate IS NULL AND
DP1.StartQty=1 AND
(DeliveryQty BETWEEN DP.StartQty AND DP.EndQty OR D.Container='PCS') AND
D.Event_Type = 'I' AND
#Period BETWEEN A.ValidFromDate AND A.ValidUntilDate
As far as the end result is concerned, the query is working fine. It is just taking so much time. Any help will be very much appreciated.
Have you tried using the recompile option for the stored procedure? If your stored procedure generated a query plan using a very small dataset, your plan may force the procedure to walk very large tables because the original query plan determined that walking the table was the most efficient method to either gather or update data. This may not complete resolve your issue but it is another option. Also, the suggestion is a great place to start to determine inefficient steps.
Related
In my SQL Server 2012 database, I have a linked server reference to a second SQL Server database that I need to pull records from and update accordingly.
I have the following update statement that I am trying to run:
UPDATE
Linked_Tbl
SET
Transferred = 1
FROM
MyLinkedServer.dbo.MyTable Linked_Tbl
JOIN
MyTable Local_Tbl ON Local_Tbl.LinkedId = Linked_Tbl.Id
JOIN
MyOtherTable Local_Tbl2 ON Local_Tbl.LocalId = Local_Tbl2.LocalId
Which I had to stop after an hour of running as it was still executing.
I've read online and found solutions stating that the best solution is to create a stored procedure on the Linked Server itself to execute the update statement rather than run it over the wire.
The problems I have are:
I don't have the ability to create any procedures on the other server.
Even if I could create that procedure, I would need to pass through all the Ids to the stored procedure for the update and I'm not sure how to do that efficiently with thousands of Ids (this, obviously, is the smaller of the issues, though since I can't create that procedure in the first place).
I'm hoping there are other solutions people may have managed to come up with given that it's often the case you don't have permissions to make changes to a different server.
Any ideas??
I am not sure, whether it can give more performance, you an try:
UPDATE
Linked_Tbl
SET
Transferred = 1
FROM OPENDATASOURCE([MyLinkedServer],'select Id, LocalId,Transferred from remotedb.dbo.MyTable') AS Linked_Tbl
JOIN MyTable Local_Tbl
ON Local_Tbl.LinkedId = Linked_Tbl.Id
JOIN MyOtherTable Local_Tbl2
ON Local_Tbl.LocalId = Local_Tbl2.LocalId
I'm new to SQL Server and am doing some cleanup of our transaction database. However, to accomplish the last step, I need to update a column in one table of one database with the value from another column in another table from another database.
I found a SQL update code snippet and re-wrote it for our own needs but would love someone to give it a once over before I hit the execute button since the update will literally affect hundreds of thousands of entries.
So here are the two databases:
Database 1: Movement
Table 1: ItemMovement
Column 1: LongDescription (datatype: text / up to 40 char)
Database 2: Item
Table 2: ItemRecord
Column 2: Description (datatype: text / up to 20 char)
Goal: set Column1 from db1 to the value of Colum2 from db2.
Here is the code snippet:
update table1
set table1.longdescription = table2.description
from movement..itemmovement as table1
inner join item..itemrecord as table2 on table1.itemcode = table2.itemcode
where table1.longdescription <> table2.description
I added the last "where" line to prevent SQL from updating the column where it already matches the source table.
This should execute faster and just update the columns that have garbage. But as it stands, does this look like it will run? And lastly, is it a straightforward process, using SQL Server 2005 Express to just backup the entire Movement db before I execute? And if it messes up, just restore it?
Alternatively, is it even necessary to re-cast the tables as table1 and table 2? Is it valid to execute a SQL query like this:
update movement..itemmovement
set itemmovement.longdescription = itemrecord.description
from movement..itemmovement
inner join item..itemrecord on itemmovement.itemcode = itemrecord.itemcode
where itemmovement.longdescription <> itemrecord.description
Many thanks in advance!
You don't necessarily need to alias your tables but I recommend you do for faster typing and reduce the chances of making a typo.
update m
set m.longdescription = i.description
from movement..itemmovement as m
inner join item..itemrecord as i on m.itemcode = i.itemcode
where m.longdescription <> i.description
In the above query I have shortened the alias using m for itemmovement and i for itemrecord.
When a large number of records are to be updated and there's question whether it would succeed or not, always make a copy in a test database (residing on a test server) and try it out over there. In this case, one of the safest bet would be to create a new field first and call it longdescription_text. You can make it with SQL Server Management Studio Express (SSMS) or using the command below:
use movement;
alter table itemmovement add column longdescription_test varchar(100);
The syntax here says alter table itemmovement and add a new column called longdescription_test with datatype of varchar(100). If you create a new column using SSMS, in the background, SSMS will run the same alter table statement to create a new column.
You can then execute
update m
set m.longdescription_test = i.description
from movement..itemmovement as m
inner join item..itemrecord as i on m.itemcode = i.itemcode
where m.longdescription <> i.description
Check data in longdescription_test randomly. You can actually do a spot check faster by running:
select * from movement..itemmovement
where longdescription <> longdescription_test
and longdescription_test is not null
If information in longdescription_test looks good, you can change your update statement to set m.longdescription = i.description and run the query again.
It is easier to just create a copy of your itemmovement table before you do the update. To make a copy, you can just do:
use movement;
select * into itemmovement_backup from itemmovement;
If update does not succeed as desired, you can truncate itemmovement and copy data back from itemmovement_backup.
Zedfoxus provided a GREAT explanation on this and I appreciate it. It is excellent reference for next time around. After reading over some syntax examples, I was confident enough in being able to run the second SQL update query that I have in my OP. Luckily, the data here is not necessarily "live" so at low risk to damage anything, even during operating hours. Given the nature of the data, the updated executed perfectly, updating all 345,000 entries!
I'm using an OLE DB Source in SSIS to pull data rows from a SQL Server 2012 database:
SELECT item_prod.wo_id, item_prod.oper_id, item_prod.reas_cd, item_prod.lot_no, item_prod.item_id, item_prod.user_id, item_prod.seq_no, item_prod.spare1, item_prod.shift_id, item_prod.ent_id, item_prod.good_prod, item_cons.lot_no as raw_lot_no, item_cons.item_id as rm_item_id, item_cons.qty_cons
FROM item_prod
LEFT OUTER JOIN item_cons on item_cons.wo_id=item_prod.wo_id AND item_cons.oper_id=item_prod.oper_id AND item_cons.seq_no=item_prod.seq_no AND item_prod.lot_no=item_cons.fg_lot_no
This works great, and is able to pull around 1 million rows per minute currently. A left outer join is used instead of a lookup due to much better performance when using no cache, and both tables may contain upwards of 40 million rows.
We need the query to only pull rows that haven't been pulled in a previous run. The last run row_id gets stored in a variable and put at the end of the above query:
WHERE item_prod.row_id > ?
On the first run, the parameter will be -1 (to parse everything). Performance drops between 5-10x by adding the where clause (1 million rows per 5-10 minutes). What is causing such a significant performance drop, and is there a way to optimize it?
It turns out, SSIS creates a stored procedure when executing a query with parameters. This was discovered by looking at the execution in SQL Server Profiler.
As a result, there was a performance hit, which I believe is related to parameter sniffing.
I changed the source to use a SQL Query from Variable and built my query using an expression instead, and this fixed the performance.
Edit: The following are the commands seen in SQL Server Profiler when executing the question's code with the where parameter:
exec [sys].sp_describe_undeclared_parameters N'SELECT item_prod.wo_id, item_prod.oper_id, item_prod.reas_cd, item_prod.lot_no, item_prod.item_id, item_prod.user_id, item_prod.seq_no, item_prod.spare1, item_prod.shift_id, item_prod.ent_id, item_prod.good_prod, item_cons.lot_no as raw_lot_no, item_cons.item_id as rm_item_id, item_cons.qty_cons
FROM item_prod
LEFT OUTER JOIN item_cons on item_cons.wo_id=item_prod.wo_id AND item_cons.oper_id=item_prod.oper_id AND item_cons.seq_no=item_prod.seq_no AND item_prod.lot_no=item_cons.fg_lot_no
WHERE item_prod.row_id > #P1'
declare #p1 int
set #p1=1
exec sp_prepare #p1 output,N'#P1 int',N'SELECT item_prod.wo_id, item_prod.oper_id, item_prod.reas_cd, item_prod.lot_no, item_prod.item_id, item_prod.user_id, item_prod.seq_no, item_prod.spare1, item_prod.shift_id, item_prod.ent_id, item_prod.good_prod, item_cons.lot_no as raw_lot_no, item_cons.item_id as rm_item_id, item_cons.qty_cons
FROM item_prod
LEFT OUTER JOIN item_cons on item_cons.wo_id=item_prod.wo_id AND item_cons.oper_id=item_prod.oper_id AND item_cons.seq_no=item_prod.seq_no AND item_prod.lot_no=item_cons.fg_lot_no
WHERE item_prod.row_id > #P1',1
select #p1
exec [sys].sp_describe_first_result_set N'SELECT item_prod.wo_id, item_prod.oper_id, item_prod.reas_cd, item_prod.lot_no, item_prod.item_id, item_prod.user_id, item_prod.seq_no, item_prod.spare1, item_prod.shift_id, item_prod.ent_id, item_prod.good_prod, item_cons.lot_no as raw_lot_no, item_cons.item_id as rm_item_id, item_cons.qty_cons
FROM item_prod
LEFT OUTER JOIN item_cons on item_cons.wo_id=item_prod.wo_id AND item_cons.oper_id=item_prod.oper_id AND item_cons.seq_no=item_prod.seq_no AND item_prod.lot_no=item_cons.fg_lot_no
WHERE item_prod.row_id > #P1',N'#P1 int',1
Since I'm not entirely sure what the above generated code does, there may be other related commands that I missed. Originally, I assumed SSIS variables would be inserted into the query, but the introduction of the #P1 parameter led me to look at stored procedure implications instead.
Is there an easier way of cleaning up a database that has a ton of stored procedures that I'm sure there are some that aren't used anymore other than one by one search.
I'd like to search my visual studio solution for stored procedures and see which ones from the database aren't used any longer.
You could create a list of the stored procedures in the database. Store them into a file.
SELECT *
FROM sys.procedures;
Then read that file into memory and run a large regex search across the source files for each entry. Once you hit the first match for a given stored procedure, you can move on.
If there are no matches for a given stored procedure, you probably can look more closely at that stored procedure and may even be able to remove it.
I'd be careful removing stored procedures - you also need to check that no other stored procedures depend on your candidate for removal!
I would use the profiler and set up a trace. The big problem is tracking SPs which are only used monthly or annually.
Anything not showing up in the trace can be investigated. I sometimes instrument individual SPs to log their invocations to a table and then review the table for activity. I've even had individual SPs instrumented to send me email when they are called.
It is relatively easy to ensure that an SP is not called from anywhere else in the server by searching the INFORAMTION_SCHEMA.ROUTINES or in source code with GREP. It's a little harder to check in SSIS packages and jobs.
But none of this eliminates the possibility that there might be the occasional SP which someone calls manually each month from SSMS to correct a data anomaly or something.
Hmmm... you could search your solution for code which calls a stored proc, like this (from the DAAB)
using (DbCommand cmd = DB.GetStoredProcCommand("sp_blog_posts_get_by_title"))
{
DB.AddInParameter(cmd, "#title", DbType.String,title);
using (IDataReader rdr = DB.ExecuteReader(cmd))
result.Load(rdr);
}
Search for the relevant part of the first line:
DB.GetStoredProcCommand("
Copy the search results from the "find results" pane, and compare to your stored proc list in the database (which you can generate with a select from the sysObjects table if you're using SQL Server).
If you really want to get fancy, you could write a small app (or use GREP or similar) to perform a regex match against your .cs files to extract a list of stored procedures, sort the list, generate a list of stored procs from your database via select from sysobjects, and do a diff. That might be easier to automate.
UPDATE Alternatively, see this link. The author suggest setting up a trace for a period of a week or so and comparing your list of procs against those found in the trace. Another author suggested: (copied)
-- Unused tables & indexes. Tables have index_id’s of either 0 = Heap table or 1 = Clustered Index
SELECT OBJECTNAME = OBJECT_NAME(I.OBJECT_ID), INDEXNAME = I.NAME, I.INDEX_ID
FROM SYS.INDEXES AS I
INNER JOIN SYS.OBJECTS AS O
ON I.OBJECT_ID = O.OBJECT_ID
WHERE OBJECTPROPERTY(O.OBJECT_ID,'IsUserTable') = 1
AND I.INDEX_ID
NOT IN (SELECT S.INDEX_ID
FROM SYS.DM_DB_INDEX_USAGE_STATS AS S
WHERE S.OBJECT_ID = I.OBJECT_ID
AND I.INDEX_ID = S.INDEX_ID
AND DATABASE_ID = DB_ID(db_name()))
ORDER BY OBJECTNAME, I.INDEX_ID, INDEXNAME ASC
which should find objects that haven't been used since a specified date. Note that I haven't tried either of these approaches, but they seem reasonable.
The question from quite a long time boiling in my head, that out of the following two stored procedures which one would perform better.
Proc 1
CREATE PROCEDURE GetEmployeeDetails #EmployeeId uniqueidentifier,
#IncludeDepartmentInfo bit
AS
BEGIN
SELECT * FROM Employees
WHERE Employees.EmployeeId = #EmployeeId
IF (#IncludeDepartmentInfo = 1)
BEGIN
SELECT Departments.* FROM Departments, Employees
WHERE Departments.DepartmentId = Employees.DepartmentId
AND Employees.EmployeeId = #EmployeeId
END
END
Proc 2
CREATE PROCEDURE GetEmployeeDetails #EmployeeId uniqueidentifier,
#IncludeDepartmentInfo bit
AS
BEGIN
SELECT * FROM Employees
WHERE Employees.EmployeeId = #EmployeeId
SELECT Departments.* FROM Departments, Employees
WHERE Departments.DepartmentId = Employees.DepartmentId
AND Employees.EmployeeId = #EmployeeId
AND #IncludeDepartmentInfo = 1
END
the only difference between the two is use of 'if statment'.
if proc 1/proc 2 are called with alternating values of #IncludeDepartmentInfo then from my understanding proc 2 would perform better, because it will retain the same query plan irrespective of the value of #IncludeDepartmentInfo, whereas proc1 will change query plan in each call
answers are really appericated
PS: this is just a scenario, please don't go to the explicit query results but the essence of example. I am really particular about the query optimizer result (in both cases of 'if and where' and their difference), there are many aspects which I know could affect the performance which I want to avoid in this question.
SELECT Departments.* FROM Departments, Employees
WHERE Departments.DepartmentId = Employees.DepartmentId
AND Employees.EmployeeId = #EmployeeId
AND #IncludeDepartmentInfo = 1
When SQL compiles a query like this it must be compiled for any value of #IncludeDepartmentInfo. The resulted plan can well be one that scans the tables and performs the join and after that checks the variable, resulting in unnecessary I/O. The optimizer may be smart and move the check for the variable ahead of the actual I/O operations in the execution plan, but this is never guaranteed. This is why I always recommend to use explicit IFs in the T-SQL for queries that need to perform very differently based on a variable value (the typical example being OR conditions).
gbn's observation is also an important one: from an API design point of view is better to have a consistent return type (ie. always return the same shaped and number of result sets).
From a consistency perspective, number 2 will always return 2 datasets. Overloading aside, you wouldn't have a client code method that may be returns a result, maybe not.
If you reuse this code, the other calling client will have to know this flag too.
If the code does 2 different things, then why not 2 different stored procs?
Finally, it's far better practice to use modern JOIN syntax and separate joining from filtering. In this case, personally I'd use EXISTS too.
SELECT
D.*
FROM
Departments D
JOIN
Employees E ON D.DepartmentId = E.DepartmentId
WHERE
E.EmployeeId = #EmployeeId
AND
#IncludeDepartmentInfo = 1
When you use the 'if' statement, you may run only one query instead of two. I would think that one query would almost always be faster than two. Your point about query plans may be valid if the first query were complex and took a long time to run, and the second was trivial. However, the first query looks like it retrieves a single row based on a primary key - probably pretty fast every time. So, I would keep the 'if' - but I would test to verify.
The performance difference would be too small for anyone to notice.
Premature optimization is the root of all evil. Stop worrying about performance and start implementing features that make your customers smile.