SQL Server execution time of query exponentially increase - sql-server

I am currently running into some performance issues when running a query which joins multiple tables. The main table has 170 million records, so it is pretty big.
What I encounter is that when I run the query with a top 1000 clause, the results are instantaneous. However, when I increase that to top 8000 the query easily runs for 15 minutes (and then I kill it). Through trial and error I found that the tipping point is with Top 7934 (works like a charm) and Top 7935 (Runs for ever)
Does someone recognise this behaviour and sees what I am doing wrong? Maybe my Query is faulty in some respects.
Thanks in advance
SELECT top 7934 h.DocIDBeg
,h.[Updated By]
,h.Action
,h.Type
,h.Details
,h.[Update Date]
,h.[Updated Field Name]
,i.Name AS 'Value Set To'
,COALESCE(i.Name,'') + COALESCE(h.NewValue, '') As 'Value Set To'
,h.OldValue
FROM
(SELECT g.DocIDBeg
,g.[Updated By]
,g.Action
,g.Type
,g.Details
,g.[Update Date]
,CAST(g.details as XML).value('auditElement[1]/field[1]/#name','nvarchar(max)') as 'Updated Field Name'
,CAST(g.details as XML).value('(/auditElement//field/setChoice/node())[1]','nvarchar(max)') as 'value'
,CAST(g.details as XML).value('(/auditElement//field/newValue/node())[1]','nvarchar(max)') as 'NewValue'
,CAST(g.details as XML).value('(/auditElement//field/oldValue/node())[1]','nvarchar(max)') as 'OldValue'
FROM(
SELECT a.ArtifactID
,f.DocIDBeg
,b.FullName AS 'Updated By'
,c.Action
,e.ArtifactType AS 'Type'
,a.Details
,a.TimeStamp AS 'Update Date'
FROM [EDDS1015272].[EDDSDBO].[AuditRecord] a
LEFT JOIN [EDDS1015272].[EDDSDBO].AuditUser b
ON a.UserID = b.UserID
LEFT JOIN [EDDS1015272].[EDDSDBO].AuditAction c
ON a.Action = c.AuditActionID
LEFT JOIN [EDDS1015272].[EDDSDBO].[Artifact] d
ON a.ArtifactID = d.ArtifactID
LEFT JOIN [EDDS1015272].[EDDSDBO].[ArtifactType] e
ON d.ArtifactTypeID = e.ArtifactTypeID
INNER JOIN [EDDS1015272].[EDDSDBO].[Document] f
ON a.ArtifactID = f.ArtifactID
) g
) h
LEFT JOIN [EDDS1015272].[EDDSDBO].[Code] i
ON h.value = i.ArtifactID

I used to work with data warehouses a lot and encountered similar problems quite often. The root cause is obviously in memory usage like it was already mentioned here. I don't think that rewriting your query will help a lot if you really need to query all 170 million records and I don't think that it is OK for you to wait for more memory resources.
So here is just a simple workaround from me:
Try to split your query. For example, first query all data you need from AuditRecord record table joined to AuditUser table and store the result in another(temporary table for example) table. Then join this new table with Artifact table and so on. In this case this steps will require less memory one by one then running the whole query and have it hung out. So in the long run you will have not a query but a scrip which will be easy to track as you can print out some statuses in the console and which will do his job unlike the query which never ends
Also make sure that you really need to query all this data at once, because I can think of no use cases why you need it, but still if it is an application then you should implement paging, if it is some export functionality then maybe there is a timeline you can use to batch data. For example to export data on a daily basis and query only the data from yersterday. In this case you will come up with an incremental export.

"Through trial and error I found that the tipping point is with Top 7934 (works like a charm) and Top 7935 (Runs for ever)"
This sounds very much like a spill. Adam Mechanic does a nice demo of the internals of this in the video below. Basically the top forces a sort which requires memory. If the memory grant is not big enough to complete the operation, some of it gets done on disk.
https://www.youtube.com/watch?v=j5YGdIk3DXw
Go to 1:03:50 to see Adam demo a spill. In his query, 668,935 rows do not spill but 668,936 rows do and the query time more than doubles.
Watch the whole session if you have time. Very good for performance tuning!
Could also be the tipping point, as #Remus suggested, but it's all guessing without knowing the actual plan.

i think the subselects are forcing the server to fetch all before the filter can be applied
this will couse more memory usage (xlm fields) and make it hard to use a decent qry plan
as to the strange top behavior: top has a big influence on qry plan generation.
it is possible that the 7935 is a cutoff point for 1 optimal plan and that sql server will choose a different path when it needs to fetch more.
or it could go back to the memory and run out of mem on 7935
update:
i reworked your qry to eliminate the nested selects, i'm not saying its now going to be that mutch faster but it eliminates some fields that werent used and it should be easyer to understand and optimize based on the qry plan.
since we don't now the exact size of each table and we can hardly run the qry to test its impossible to give you the best answer. but i could try some tips:
1 step would be to check if you need all the left joins and turn them into inner if it is not needed ex: AuditUser, an AuditRecord could always have a user?
an other thing you could try is to put the data of preferably the smaller tables in a tmp table and join the bigger tables to that tmp table, possible eliminating a lot of records to join
if possible you could denormalize a bit and for example put the username in the auditrecord 2 so you would eliminate the join on AuditUser alltogether
but it is up to wat you need wat you can/are allowed to and the data/server
SELECT top 7934 f.DocIDBeg
,b.FullName AS 'Updated By'
,c.Action
,e.ArtifactType AS 'Type'
,a.Details
,a.TimeStamp AS 'Update Date'
,CAST(a.Details as XML).value('auditElement[1]/field[1]/#name','nvarchar(max)') as 'Updated Field Name'
,i.Name AS 'Value Set To'
,COALESCE(i.Name,'') + COALESCE(CAST(a.Details as XML).value('(/auditElement//field/newValue/node())[1]','nvarchar(max)') as 'NewValue', '') As 'Value Set To'
,CAST(a.Details as XML).value('(/auditElement//field/oldValue/node())[1]','nvarchar(max)') as 'OldValue'
FROM [EDDS1015272].[EDDSDBO].[AuditRecord] a
LEFT JOIN [EDDS1015272].[EDDSDBO].AuditUser b
ON a.UserID = b.UserID
LEFT JOIN [EDDS1015272].[EDDSDBO].AuditAction c
ON a.Action = c.AuditActionID
LEFT JOIN [EDDS1015272].[EDDSDBO].[Artifact] d
ON a.ArtifactID = d.ArtifactID
LEFT JOIN [EDDS1015272].[EDDSDBO].[ArtifactType] e
ON d.ArtifactTypeID = e.ArtifactTypeID
INNER JOIN [EDDS1015272].[EDDSDBO].[Document] f
ON a.ArtifactID = f.ArtifactID
LEFT JOIN [EDDS1015272].[EDDSDBO].[Code] i
ON CAST(a.details as XML).value('(/auditElement//field/setChoice/node())[1]','nvarchar(max)') = i.ArtifactID

Related

Sql Server query taking too much time

Below query execution takes 45 secs. How to improve the speed?
SELECT
u.i_UserID,
u.vch_LoginName,
ea.vch_EmailAddress,
u.vch_DisplayName,
'Current' As 'vch_RecordStatus'
FROM
tblUser u
LEFT JOIN [User].dbo.tblEmailAddress ea
ON u.i_EmailAddressID = ea.i_EmailAddressID
WHERE
IsNull(u.vch_LoginName, '') Like '%'
AND u.vch_DisplayName LIKE 'kala' -- 14 secs
UNION ALL -- 29 secs
SELECT
DISTINCT l.i_UserID
,l.vch_OldLoginName as vch_LoginName
,l.vch_OldEmailAddress as vch_EmailAddress
,u.vch_DisplayName as vch_DisplayName,
'Old' As 'vch_RecordStatus'
FROM tblUserStatusLog l
INNER JOIN tblUser u on u.i_UserID = l.i_UserID
WHERE
l.vch_OldLoginName Like '%'
AND u.vch_DisplayName LIKE 'kala' -- 28 secs
ORDER BY
u.vch_LoginName
First of all, why are you using LIKE to compare strings? LIKE is used to compare partially strings, you should use = operator to search for exact match.
I also don't understand the first condition on both of the where clauses, are you trying to check they are not empty strings and not null? better of use IS NOT NULL and <> operator:
SELECT
u.i_UserID,
u.vch_LoginName,
ea.vch_EmailAddress,
u.vch_DisplayName,
'Current' As 'vch_RecordStatus'
FROM
tblUser u
LEFT JOIN [User].dbo.tblEmailAddress ea ON u.i_EmailAddressID = ea.i_EmailAddressID
WHERE
u.vch_LoginName is not null and u.vch_loginname <> ''
AND u.vch_DisplayName = 'kala'
UNION ALL
SELECT
DISTINCT l.i_UserID
,l.vch_OldLoginName as vch_LoginName
,l.vch_OldEmailAddress as vch_EmailAddress
,u.vch_DisplayName as vch_DisplayName,
'Old' As 'vch_RecordStatus'
FROM tblUserStatusLog l
INNER JOIN tblUser u on u.i_UserID = l.i_UserID
WHERE
l.vch_OldLoginName <> ''
AND u.vch_DisplayName = 'kala'
ORDER BY
u.vch_LoginName
Other then that, I assume you can significantly improve the performance by adding the proper indexes to the tables, but we will need your table structures for that to see what already exists .
But you should have indexes on i_EmailAddressID on first two tables, i_userId on the second ones , and an index on vch_OldLoginName,vch_displayName might also improve the run.
You'd want to take a look at indexing. When you run this query with the actual execution plan switched on do you get any missing query hints? Make sure all indexes are optimised for this query and you'll get much better performance.
You're including the full path for tblEmailAddress, is this stored in another database called 'User'? If so, you'll want to reduce the amount of data you're pulling from the linked server. Again, an index should help you here.
You also don't want to be doing calculations in your where clauses. Check out the SARGability of queries. https://en.wikipedia.org/wiki/Sargable This will tell you why your LIKE statements are a bad idea too.

Adding Conditional around query increases time by over 2400%

Update: I will get query plan as soon as I can.
We had a poor performing query that took 4 minutes for a particular organization. After the usual recompiling the stored proc and updating statistics didn't help, we re-wrote the if Exists(...) to a select count(*)... and the stored procedure when from 4 minutes to 70 milliseconds. What is the problem with the conditional that makes a 70 ms query take 4 minutes? See the examples
These all take 4+ minutes:
if (
SELECT COUNT(*)
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL)) > 0
print 'records';
-
IF (EXISTS(
SELECT *
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL))
print 'records'
This all take 70 milliseconds:
Declare #recordCount INT;
SELECT #recordCount = COUNT(*)
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL);
IF(#recordCount > 0)
print 'records';
It doesn't make sense to me why moving the exact same Count(*) query into an if statement causes such degradation or why 'Exists' is slower than Count. I even tried the exists() in a select CASE WHEN Exists() and it is still 4+ minutes.
Given that my previous answer was mentioned, I'll try to explain again because these things are pretty tricky. So yes, I think you're seeing the same problem as the other question. Namely a row goal issue.
So to try and explain what's causing this I'll start with the three types of joins that are at the disposal of the engine (and pretty much categorically): Loop Joins, Merge Joins, Hash Joins. Loop joins are what they sound like, a nested loop over both sets of data. Merge Joins take two sorted lists and move through them in lock-step. And Hash joins throw everything in the smaller set into a filing cabinet and then look for items in the larger set once the filing cabinet has been filled.
So performance wise, loop joins require pretty much no set up and if you're only looking for a small amount of data they're really optimal. Merge are the best of the best as far as join performance for any data size, but require data to be already sorted (which is rare). Hash Joins require a fair amount of setup but allow large data sets to be joined quickly.
Now we get to your query and the difference between COUNT(*) and EXISTS/TOP 1. So the behavior you're seeing is that the optimizer thinks that rows of this query are really likely (you can confirm this by planning the query without grouping and seeing how many records it thinks it will get in the last step). In particular it probably thinks that for some table in that query, every record in that table will appear in the output.
"Eureka!" it says, "if every row in this table ends up in the output, to find if one exists I can do the really cheap start-up loop join throughout because even though it's slow for large data sets, I only need one row." But then it doesn't find that row. And doesn't find it again. And now it's iterating through a vast set of data using the least efficient means at its disposal for weeding through large sets of data.
By comparison, if you ask for the full count of data, it has to find every record by definition. It sees a vast set of data and picks the choices that are best for iterating through that entire set of data instead of just a tiny sliver of it.
If, on the other hand, it really was correct and the records were very well correlated it would have found your record with the smallest possible amount of server resources and maximized its overall throughput.

Slow running BIDS report query...OR clauses in WHERE branch using IN sub queries

I have a report with 3 sub reports and several queries to optimize in each. The first has several OR clauses in the WHERE branch and the OR's are filtering through IN options which are pulling sub-queries.
I say this mostly from reading this SO post. Specifically LBushkin's second point.
I'm not the greatest at TSQL but I know enough to think this is very inefficient. I think I need to do two things.
I know I need to add indexes to the tables involved.
I think the query can be greatly enhanced.
So it seems that my first step would be to improve the query. From there I can look at what columns and tables are involved and thus determine the indexes.
At this point I haven't posted table schemas as I'm looking more for options / considerations such as using a cte to replace all the IN sub-queries.
If needed I will definitely post whatever would be helpful such as physical reads etc.
SELECT DISTINCT
auth.icm_authorizationid,
auth.icm_documentnumber
FROM
Filteredicm_servicecost AS servicecost
INNER JOIN Filteredicm_authorization AS auth ON
auth.icm_authorizationid = servicecost.icm_authorizationid
INNER JOIN Filteredicm_service AS service ON
service.icm_serviceid = servicecost.icm_serviceid
INNER JOIN Filteredicm_case AS cases ON
service.icm_caseid = cases.icm_caseid
WHERE
(cases.icm_caseid IN
(SELECT icm_caseid FROM Filteredicm_case AS CRMAF_Filteredicm_case))
OR (service.icm_serviceid IN
(SELECT icm_serviceid FROM Filteredicm_service AS CRMAF_Filteredicm_service))
OR (servicecost.icm_servicecostid IN
(SELECT icm_servicecostid FROM Filteredicm_servicecost AS CRMAF_Filteredicm_servicecost))
OR (auth.icm_authorizationid IN
(SELECT icm_authorizationid FROM Filteredicm_authorization AS CRMAF_Filteredicm_authorization))
EXISTS is usually much faster than IN as the query engine is able to optimize it better.
Try this:
WHERE EXISTS (SELECT 1 FROM FROM Filteredicm_case WHERE icm_caseid = cases.icm_caseid)
OR EXISTS (SELECT 1 FROM Filteredicm_service WHERE icm_serviceid = service.icm_serviceid)
OR EXISTS (SELECT 1 FROM Filteredicm_servicecost WHERE icm_servicecostid = servicecost.icm_servicecostid)
OR EXISTS (SELECT 1 FROM Filteredicm_authorization WHERE icm_authorizationid = auth.icm_authorizationid)
Furthermore, an index on Filteredicm_case.icm_caseid, an index on Filteredicm_service.icm_serviceid, an index on Filteredicm_servicecost.icm_servicecostid, and an index on Filteredicm_authorization.icm_authorizationid will increase performance of this query. They look like they should be keys already, however, so I suspect that indices already exist.
However, unless I'm misreading, there's no way this WHERE clause will ever evaluate to anything other than true.
The clause you wrote says WHERE cases.icm_caseid IN (SELECT icm_caseid FROM Filteredicm_case AS CRMAF_Filteredicm_case). However, cases is an alias to Filteredicm_case. That's the equivalent of WHERE Filteredicm_case.icm_caseid IN (SELECT icm_caseid FROM Filteredicm_case AS CRMAF_Filteredicm_case). That will be true as long as Filteredicm_case.icm_caseid isn't NULL.
The same error in logic exists for the remaining portions in the WHERE clause:
(service.icm_serviceid IN (SELECT icm_serviceid FROM Filteredicm_service AS CRMAF_Filteredicm_service))
service is an alias for Filteredicm_service. This is always true as long as icm_serviceid is not null
(servicecost.icm_servicecostid IN (SELECT icm_servicecostid FROM Filteredicm_servicecost AS CRMAF_Filteredicm_servicecost))
servicecost is an alias for Filteredicm_servicecost. This is always true as long as icm_servicecostid is not null.
(auth.icm_authorizationid IN (SELECT icm_authorizationid FROM Filteredicm_authorization AS CRMAF_Filteredicm_authorization))
auth is an alias for Filteredicm_authorization. This is always true as long as icm_authorizationid is not null.
I don't understand what you're trying to accomplish.

Which is faster: JOIN with GROUP BY or a Subquery?

Let's say we have two tables: 'Car' and 'Part', with a joining table in 'Car_Part'. Say I want to see all cars that have a part 123 in them. I could do this:
SELECT Car.Col1, Car.Col2, Car.Col3
FROM Car
INNER JOIN Car_Part ON Car_Part.Car_Id = Car.Car_Id
WHERE Car_Part.Part_Id = #part_to_look_for
GROUP BY Car.Col1, Car.Col2, Car.Col3
Or I could do this
SELECT Car.Col1, Car.Col2, Car.Col3
FROM Car
WHERE Car.Car_Id IN (SELECT Car_Id FROM Car_Part WHERE Part_Id = #part_to_look_for)
Now, everything in me wants to use the first method because I've been brought up by good parents who instilled in me a puritanical hatred of sub-queries and a love of set theory, but it has been suggested to me that doing that big GROUP BY is worse than a sub-query.
I should point out that we're on SQL Server 2008. I should also say that in reality I want to select based the Part Id, Part Type and possibly other things too. So, the query I want to do actually looks like this:
SELECT Car.Col1, Car.Col2, Car.Col3
FROM Car
INNER JOIN Car_Part ON Car_Part.Car_Id = Car.Car_Id
INNER JOIN Part ON Part.Part_Id = Car_Part.Part_Id
WHERE (#part_Id IS NULL OR Car_Part.Part_Id = #part_Id)
AND (#part_type IS NULL OR Part.Part_Type = #part_type)
GROUP BY Car.Col1, Car.Col2, Car.Col3
Or...
SELECT Car.Col1, Car.Col2, Car.Col3
FROM Car
WHERE (#part_Id IS NULL OR Car.Car_Id IN (
SELECT Car_Id
FROM Car_Part
WHERE Part_Id = #part_Id))
AND (#part_type IS NULL OR Car.Car_Id IN (
SELECT Car_Id
FROM Car_Part
INNER JOIN Part ON Part.Part_Id = Car_Part.Part_Id
WHERE Part.Part_Type = #part_type))
The best thing you can do is test them yourself, on realistic data volumes. That would not only benefit for this query, but for all future queries when you are not sure which is the best way.
Important things to do include:
- test on production level data volumes
- test fairly & consistently (clear cache: http://www.adathedev.co.uk/2010/02/would-you-like-sql-cache-with-that.html)
- check the execution plan
You could either monitor using SQL Profiler and check the duration/reads/writes/CPU there, or SET STATISTICS IO ON; SET STATISTICS TIME ON; to output stats in SSMS. Then compare the stats for each query.
If you can't do this type of testing, you'll be potentially exposing yourself to performance problems down the line that you'll have to then tune/rectify. There are tools out there you can use that will generate data for you.
I have similar data so I checked the execution plan for both styles of query. To my surprise, the Column In Subquery (CIS) produced an execution plan with 25% less I/O cost to than the inner join (IJ) query. In the CIS execution plan I get an 2 index scans of the intermediate table (Car_Part) versus an index scan of the intermediate and a relatively more expensive hash join in the IJ. My indexes are healthy but non-clustered so it stands to reason that the index scans might be made a bit faster by clustering them. I doubt this would impact the cost of the hash join which is the more expensive step in the IJ query.
Like the others have pointed out, it depends on your data. If you're working with many gigabytes in these 3 tables then tune away. If your rows are numbered in the hundreds or thousands then you might be splitting hairs over a very small performance gain. I would say that the IJ query is much more readable so as long as it's good enough, do any future developer who touches your code a favour and give them something easier to read. The row count in my tables are 188877, 283912, 13054 and both queries returned in less time that it took to sip coffee.
Small postscript: since you're not aggregating any numerical values, it looks like you mean to select distinct. Unless you're actually going to do something with the group, it's easier to see your intention with select distinct rather than group by at the end. IO cost is the same but one indicates your intention better IMHO.
With SQL Server 2008 I would expect In to be quicker as it is equivalent to this.
SELECT Car.Col1, Car.Col2, Car.Col3
FROM Car
WHERE EXISTS(SELECT * FROM Car_Part
WHERE Car_Part.Car_Id = Car.Car_Id
AND Car_Part.Part_Id = #part_to_look_for
)
i.e. it only has to check for the existence of the row not join it then remove duplicates. This is discussed here.

One T-SQL query output to multiple record sets

Don't ask for what, but i need two tables from one SQL query.
Like this...
Select Abc, Dgf from A
and result are two tables
abc
1
1
1
dgf
2
2
2
More details?
Ok lets try.
Now i have sp like this:
SELECT a.* from ActivityView as a with (nolock)
where a.WorkplaceGuid = #WorkplaceGuid
SELECT b.* from ActivityView as a with (nolock)
left join PersonView as b with (nolock) on a.PersonGuid=b.PersonGuid where a.WorkplaceGuid = #WorkplaceGuid
It's cool. But execution time about 22 seconds. I do this because in my programm i have classes that automaticly get data from records set. Class Activity and class Person. That why i can't make it in one recordset. Program didn't parse it.
You can write a stored procedure that has two SELECTs.
SELECT Abc FROM A AS AbcTable;
SELECT Dgf FROM A AS DfgTable;
Depending on your specific requirements, I would consider just submitting two separate queries. I don't see any advantage to combining them.
SQL Server supports legacy COMPUTE BY clause which acts almost like GROUP BY but returns multiple resultsets (the resultsets constituting each group followed by the resultsets with the aggregates):
WITH q AS
(
SELECT 1 AS id
UNION ALL
SELECT 2 AS id
)
SELECT *
FROM q
ORDER BY
id
COMPUTE COUNT(id)
BY id
This, however, is obsolete and is to be removed in the future releases.
Those don't seem to be excessively complicated queries (although select * should in general not be used in production and never when you are doing a join as it needlessly wastes resources sending the value of the joined field twice). Therefore if it is taking 22 seconds, then either you are returning a huge amount of data or you don't have proper indexing.
Have you looked at the execution plans to see what is causing the slowness?

Resources