I need a stored procedure that updates one of my tables, which has millions of records. For simplicity's sake, let's say it only does SET LastUpdated = GETUTCDATE(). The stored procedure should be able to do the following things with the best performance possible.
Update all records (no WHERE)
Update 1 record (WHERE [Id] = #Id)
Update n records (WHERE [Id] IN (#IdCsv))
What's the best way of achieving this?
Should I create three separate stored procedures? This would make the stored procedures less manageable because I'd have to keep three stored procedures up to date instead of one. However if this gets me the best performance, I wouldn't mind having three stored procedures instead of one. But is this really the best option, performance wise? Three separate stored procedures would mean three separate query plans, right?
I could also put everything in a single stored procedure with a nvarchar parameter which contains the IDs, comma-separated. Then, combined with EXEC I could do this:
WHERE [Id] IN (' + #IdCsv + '). I can further improve this by omitting the where statement if #IdCsv is null. This solution is a lot more manageable, but does it perform well?
The last solution I could think of is using a table-valued parameter. The condition would look like this: WHERE #IdTable IS NULL OR [Id] IN (SELECT [Id] FROM #IdTable). This solution is also a lot more manageable than the first, and it also avoids the use of EXEC. However, I can't help but feel this would perform the worst, even if this is the only solution that would lead to one consistent query plan. The WHERE condition in this one is a lot more complex than the others.
You have to choose between high maintainable code or high performance.
Check the execution plan when you write a high maintainable code.
DECLARE #ID INT
SET #ID=NULL
DECLARE #IdTable TABLE(ID INT)
UPDATE Test
SET LastUpdated = GETDATE()
WHERE (ID = #ID OR #ID IS NULL)
OR EXISTS
(
SELECT 1 FROM #IdTable T WHERE T.ID = ID
)
If you see the execution plan, a table scan is happening on #IdTable which is costing 25% of total execution cost. Definitely you can remove it by using a '#' temporary table with index on Id, but still it will be a overhead to the query.
When you want high performing query like following.
UPDATE Test
SET LastUpdated = GETDATE()
I suggest you go with a single update, it should work fine if your ID
column is Indexed. SQL Server is optimized and capable of handling
huge volumes of records.
Related
I am trying to build a stored procedure that retrieve information from few tables in my databases. I often use variable table to hold data since I have to return it in a result set and also reuse it in following queries instead of requiring the table multiple times.
Is this a good and common way to do that ?
So I started having performance issues when testing the stored procedure. By the way is there an efficient way to test is without having to change the parameter each times ? If I don't change parameter values the query will take only a few milliseconds to run I assume it use some sort of cache.
So I was starting having performance issues when the day before everything was working well so I reworked my queries looked that all index was being used correctly etc. Then I tried switching variable table for temp table just for testing purpose and bingo the 2 or 3 next tests ran like a charm and then performance issues started to appear again. So I am a bit clueless on what happens here and why it happen.
I am running my tests on the production db since it doesn't update or insert anything. There is a piece of code to give you an idea of my test case
--Stuff going on to get values in a temps table for the next query
DECLARE #ApplicationIDs TABLE(ID INT)
-- This table have over 110 000 000 rows and this query use one of its indexes. The query insert between 1 and 10-20k rows
INSERT INTO #ApplicationIDs(ID)
SELECT ApplicationID
FROM Schema.Application
WHERE Columna = value
AND Columnb = value
AND Columnc = value
-- I request the table again but joined with other tables to have my final resultset no performance issues here. ApplicationID is the clustered primary key
SELECT Columns
FROM Schema.Application
INNER JOIN SomeTable ON Columna = Columnb
WHERE ApplicationID IN (SELECT ID FROM #ApplicationIDs)
--There is where it starts happening this table has around 200 000 000 rows and about 50 columns and yes the applicationid column is indexed (nonclustered). I use this index that way in few other context and it work well just not this one
SELECT Columns
FROM Schema.SubApplication
WHERE ApplicationID IN (SELECT ID FROM #ApplicationIDs)
The server is in a VM with 64 gb of ram and SQL have 56GB allocated.
Let me know if you need further details.
I am currently performing analysis on a client's MSSQL Server. I've already fixed many issues (unnecessary indexes, index fragmentation, NEWID() being used for identities all over the shop etc), but I've come across a specific situation that I haven't seen before.
Process 1 imports data into a staging table, then Process 2 copies the data from the staging table using an INSERT INTO. The first process is very quick (it uses BULK INSERT), but the second takes around 30 mins to execute. The "problem" SQL in Process 2 is as follows:
INSERT INTO ProductionTable(field1,field2)
SELECT field1, field2
FROM SourceHeapTable (nolock)
The above INSERT statement inserts hundreds of thousands of records into ProductionTable, each row allocating a UNIQUEIDENTIFIER, and inserting into about 5 different indexes. I appreciate this process is going to take a long time, so my issue is this: while this import is taking place, a 3rd process is responsible for performing constant lookups on ProductionTable - in addition to inserting an additional record into the table as such:
INSERT INTO ProductionTable(fields...)
VALUES(values...)
SELECT *
FROM ProductionTable (nolock)
WHERE ID = #Id
For the 30 or so minutes that the INSERT...SELECT above is taking place, the INSERT INTO times-out.
My immediate thought is that SQL server is locking the entire table during the INSERT...SELECT. I did quite a lot of profiling on the server during my analysis, and there are definitely locks being allocated for the duration of the INSERT...SELECT, though I fail remember what type they were.
Having never needed to insert records into a table from two sources at the same time - at least during an ETL process - I'm not sure how to approach this. I've been looking up INSERT table hints, but most are being made obsolete in future versions.
It looks to me like a CURSOR is the only way to go here?
You could consider BULK INSERT for Process-2 to get the data into the ProductionTable.
Another option would be to batch Process-2 into small batches of around 1000 records and use a Table Valued Parameter to do the INSERT. See: http://msdn.microsoft.com/en-us/library/bb510489.aspx#BulkInsert
It seems like table lock.
Try portion insert in ETL process. Something like
while 1=1
begin
INSERT INTO ProductionTable(field1,field2)
SELECT top (1000) field1, field2
FROM SourceHeapTable sht (nolock)
where not exists (select 1 from ProductionTable pt where pt.id = sht.id)
-- optional
--waitfor delay '00:00:01.0'
if ##rowcount = 0
break;
end
Well, I have a table which is 40,000,000+ records but when I try to execute a simple query, it takes ~3 min to finish execution. Since I am using the same query in my c# solution, which it needs to execute over 100+ times, the overall performance of the solution is deeply hit.
This is the query that I am using in a proc
DECLARE #Id bigint
SELECT #Id = MAX(ExecutionID) from ExecutionLog where TestID=50881
select #Id
Any help to improve the performance would be great. Thanks.
What indexes do you have on the table? It sounds like you don't have anything even close to useful for this particular query, so I'd suggest trying to do:
CREATE INDEX IX_ExecutionLog_TestID ON ExecutionLog (TestID, ExecutionID)
...at the very least. Your query is filtering by TestID, so this needs to be the primary column in the composite index: if you have no indexes on TestID, then SQL Server will resort to scanning the entire table in order to find rows where TestID = 50881.
It may help to think of indexes on SQL tables in the same way as those you'd find in the back of a big book that are hierarchial and multi-level. If you were looking for something, then you'd manually look under 'T' for TestID then there'd be a sub-heading under TestID for ExecutionID. Without an index entry for TestID, you'd have to read through the entire book looking for TestID, then see if there's a mention of ExecutionID with it. This is effectively what SQL Server has to do.
If you don't have any indexes, then you'll find it useful to review all the queries that hit the table, and ensure that one of those indexes is a clustered index (rather than non-clustered).
Try to re-work everything into something that works in a set based manner.
So, for instance, you could write a select statement like this:
;With OrderedLogs as (
Select ExecutionID,TestID,
ROW_NUMBER() OVER (PARTITION BY TestID ORDER By ExecutionID desc) as rn
from ExecutionLog
)
select * from OrderedLogs where rn = 1 and TestID in (50881, 50882, 50883)
This would then find the maximum ExecutionID for 3 different tests simultaneously.
You might need to store that result in a table variable/temp table, but hopefully, instead, you can continue building up a larger, single, query, that processes all of the results in parallel.
This is the sort of processing that SQL is meant to be good at - don't cripple the system by iterating through the TestIDs in your code.
If you need to pass many test IDs into a stored procedure for this sort of query, look at Table Valued Parameters.
Background
Recently I've started to use XML a lot more as a column in SQL Server 2005. During a bit of downtime yesterday, I noticed that two of the link tables I used a really just in the way and it bores me to tears having to write yet more supporting structure code for a couple of joins.
To actually generate the data for these two link tables, I pass in two XML fields to my stored procedure, which writes the main record, breaks the two XML variables down into #tables and inserts them into the actual tables with the new SCOPE_IDENTITY() from the master record.
After some though, I decided to just do away with those tables altogether and just store the XML in XML fields. Now I understand there are some pitfalls here, like general querying performance, GROUP BY doesn't work on XML data. And the query is generally a bit of a mess, but overall I like that I can now work with XElement when I get the data back.
Also, this stuff isn't going to get changed. It's a one shot affair, so I don't have to worry about modification.
I am wondering about the best way to actually get at this data. A lot of my queries involve getting a master record based upon the criteria of a child or even a subchild record. Most of the sprocs in the database do this but on a far more elaborate scale, usually requiring UDFs and Subqueries to work effectively but I have knocked up a trivial example to test querying some data...
INSERT INTO Customers VALUES ('Tom', '', '<PhoneNumbers><PhoneNumber Type="1" Value="01234 456789" /><PhoneNumber Type="2" Value="01746 482954" /></PhoneNumbers>')
INSERT INTO Customers VALUES ('Andy', '', '<PhoneNumbers><PhoneNumber Type="2" Value="07948 598348" /></PhoneNumbers>')
INSERT INTO Customers VALUES ('Mike', '', '<PhoneNumbers><PhoneNumber Type="3" Value="02875 482945" /></PhoneNumbers>')
INSERT INTO Customers VALUES ('Steve', '', '<PhoneNumbers></PhoneNumbers>')
Now I can see two ways of grabbing it.
Method 1
DECLARE #PhoneType INT
SET #PhoneType = 2
SELECT ct.*
FROM Customers ct
WHERE ct.PhoneNumbers.exist('/PhoneNumbers/PhoneNumber[#Type=sql:variable("#PhoneType")]') = 1
Really? sql:variable feels a bit unwholesome. However, it does work. However it's distinctively more difficult to access data in a more meaningful way.
Method 2
SELECT ct.*, pt.PhoneType
FROM Customers ct
CROSS APPLY ct.PhoneNumbers.nodes('/PhoneNumbers/PhoneNumber') AS nums(pn)
INNER JOIN PhoneTypes pt ON pt.ID = nums.pn.value('./#Type[1]', 'int')
WHERE nums.pn.value('./#Type[1]', 'int') = #PhoneType
This is more like it. Already I can easily expand it to do joins and all other good stuff. I've used CROSS APPLY before on a table valued function, and it was very good. The execution plan for this as opposed to the previous query is seriously more advanced. Admittedly I haven't done any indexing and whatnot on these tables, but it's 97% of the entire batch cost.
Method 2 (expanded)
SELECT ct.ID, ct.CustomerName, ct.Notes, pt.PhoneType
FROM Customers ct
CROSS APPLY ct.PhoneNumbers.nodes('/PhoneNumbers/PhoneNumber') AS nums(pn)
INNER JOIN PhoneTypes pt ON pt.ID = nums.pn.value('./#Type[1]', 'int')
WHERE nums.pn.value('./#Type[1]', 'int') IN (SELECT ID FROM PhoneTypes)
Nice IN clause here. I can also do something like pt.PhoneType = 'Work'
Finally
So I'm essentially obtaining the results that I want, but is there anything I should be aware of when using this mechanism to interrogate small amounts of XML data? Will it fall down on performance during elaborate searches? And is the storage of such markup style data too much of an overhead?
Side note
I've used things like sp_xml_preparedocument and OPENXML in the past just to pass lists into sprocs, but this is like a breath of fresh air in comparison!
One approach we've taken for some of our key items of information stored inside an XML column is to "surface" them as computed, persisted properties on the "parent" table. This is done using a little stored function.
It works great, because the value is computed only once every time the XML changes - as long as it's not changing, there's no recomputation, the value is stored on the table like any other column.
It's also great since it can be indexed! So if you're searching and/or joining on such a field - that works like a charm!
So you basically need a stored function along the lines of this:
CREATE FUNCTION [dbo].[GetPhoneNo1](#DataXML XML)
RETURNS VARCHAR(50)
WITH SCHEMABINDING
AS BEGIN
DECLARE #result VARCHAR(20)
SELECT
#result = #DataXML.value('(/PhoneNumbers/PhoneNumber[#Type="1"]/#Value)[1]', 'VARCHAR(50)')
RETURN #result
END
If you don't have a phone number of type 1, you'll just get back a NULL.
Then, you need to extend your parent table with a computed, persisted column:
ALTER TABLE dbo.Customers
ADD PhoneNumberType1 AS dbo.GetPhoneNo1(PhoneNumbers)
As you can see - it works just fine for single entries, but unfortunately, you cannot surface a whole list of properties. But if you have some key items, like ID's or something, that you expect most of your rows to have, this can be a very nice and slick way to get at that information more easily and more efficiently.
I have a sql query with 50 parameters, such as this one.
DECLARE
#p0 int, #p1 int, #p2 int, (text omitted), #p49 int
SELECT
#p0=111227, #p1=146599, #p2=98917, (text omitted), #p49=125319
--
SELECT
[t0].[CustomerID], [t0].[Amount],
[t0].[OrderID], [t0].[InvoiceNumber]
FROM [dbo].[Orders] AS [t0]
WHERE ([t0].[CustomerID]) IN
(#p0, #p1, #p2, (text omitted), #p49)
The estimated execution plan shows that the database will collect these parameters, order them, and then read the index Orders.CustomerID from the smallest parameter to the largest, then do a bookmark lookup for the rest of the record.
The problem is that there the smallest and largest parameter could be quite far apart and this will lead to reading possibly the entire index.
Since this is being done in a loop from the client side (50 params sent each time, for 1000 iterations), this is a bad situation. How can I formulate the query/client side code to get my data without repetitive index scanning while keeping the number of round trips down?
I thought about ordering the 50k parameters such that smaller readings of the index would occur. There is a wierd mitigating circumstance that prevents this - I can't use this solution. To model this circumstance, just assume that I only have 50 id's available at any time and can't control their relative position in the global list.
Insert the parameters into a temporary table, then join it with your table:
DECLARE #params AS TABLE(param INT);
INSERT
INTO #params
VALUES (#p1)
...
INSERT
INTO #params
VALUES (#p49)
SELECT
[t0].[CustomerID], [t0].[Amount],
[t0].[OrderID], [t0].[InvoiceNumber]
FROM #params, [dbo].[Orders] AS [t0]
WHERE ([t0].[CustomerID]) = #params.param
This will most probably use NESTED LOOPS with a INDEX SEEK over CustomerID on each loop.
An index range scan is pretty fast. There's usually a lot less data in the index than in the table and there's a much better chance that the index is already in memory.
I can't blame you for wanting to save round trips to the server by putting each of the IDs your looking for in a bundle. If the index RANGE scan really worries you, you can create a parameterized server side cursor (e.g., in TSQL) that takes the CustomerID as a parameter. Stop as soon as you find a match. That query should definitely use an index unique scan instead of a range scan.
To build on Quassnoi's answer, if you were working with SQL 2008, you could save yourself some time by inserting all 50 items with one statement. SQL 2008 has a new feature for multiple valued inserts.
e.g.
INSERT INTO #Customers (CustID)
VALUES (#p0),
(#p1),
<snip>
(#p49)
Now #Customers table is populated and ready to INNER JOIN on, or your IN clause.