Which is faster: JOIN with GROUP BY or a Subquery? - sql-server

Let's say we have two tables: 'Car' and 'Part', with a joining table in 'Car_Part'. Say I want to see all cars that have a part 123 in them. I could do this:
SELECT Car.Col1, Car.Col2, Car.Col3
FROM Car
INNER JOIN Car_Part ON Car_Part.Car_Id = Car.Car_Id
WHERE Car_Part.Part_Id = #part_to_look_for
GROUP BY Car.Col1, Car.Col2, Car.Col3
Or I could do this
SELECT Car.Col1, Car.Col2, Car.Col3
FROM Car
WHERE Car.Car_Id IN (SELECT Car_Id FROM Car_Part WHERE Part_Id = #part_to_look_for)
Now, everything in me wants to use the first method because I've been brought up by good parents who instilled in me a puritanical hatred of sub-queries and a love of set theory, but it has been suggested to me that doing that big GROUP BY is worse than a sub-query.
I should point out that we're on SQL Server 2008. I should also say that in reality I want to select based the Part Id, Part Type and possibly other things too. So, the query I want to do actually looks like this:
SELECT Car.Col1, Car.Col2, Car.Col3
FROM Car
INNER JOIN Car_Part ON Car_Part.Car_Id = Car.Car_Id
INNER JOIN Part ON Part.Part_Id = Car_Part.Part_Id
WHERE (#part_Id IS NULL OR Car_Part.Part_Id = #part_Id)
AND (#part_type IS NULL OR Part.Part_Type = #part_type)
GROUP BY Car.Col1, Car.Col2, Car.Col3
Or...
SELECT Car.Col1, Car.Col2, Car.Col3
FROM Car
WHERE (#part_Id IS NULL OR Car.Car_Id IN (
SELECT Car_Id
FROM Car_Part
WHERE Part_Id = #part_Id))
AND (#part_type IS NULL OR Car.Car_Id IN (
SELECT Car_Id
FROM Car_Part
INNER JOIN Part ON Part.Part_Id = Car_Part.Part_Id
WHERE Part.Part_Type = #part_type))

The best thing you can do is test them yourself, on realistic data volumes. That would not only benefit for this query, but for all future queries when you are not sure which is the best way.
Important things to do include:
- test on production level data volumes
- test fairly & consistently (clear cache: http://www.adathedev.co.uk/2010/02/would-you-like-sql-cache-with-that.html)
- check the execution plan
You could either monitor using SQL Profiler and check the duration/reads/writes/CPU there, or SET STATISTICS IO ON; SET STATISTICS TIME ON; to output stats in SSMS. Then compare the stats for each query.
If you can't do this type of testing, you'll be potentially exposing yourself to performance problems down the line that you'll have to then tune/rectify. There are tools out there you can use that will generate data for you.

I have similar data so I checked the execution plan for both styles of query. To my surprise, the Column In Subquery (CIS) produced an execution plan with 25% less I/O cost to than the inner join (IJ) query. In the CIS execution plan I get an 2 index scans of the intermediate table (Car_Part) versus an index scan of the intermediate and a relatively more expensive hash join in the IJ. My indexes are healthy but non-clustered so it stands to reason that the index scans might be made a bit faster by clustering them. I doubt this would impact the cost of the hash join which is the more expensive step in the IJ query.
Like the others have pointed out, it depends on your data. If you're working with many gigabytes in these 3 tables then tune away. If your rows are numbered in the hundreds or thousands then you might be splitting hairs over a very small performance gain. I would say that the IJ query is much more readable so as long as it's good enough, do any future developer who touches your code a favour and give them something easier to read. The row count in my tables are 188877, 283912, 13054 and both queries returned in less time that it took to sip coffee.
Small postscript: since you're not aggregating any numerical values, it looks like you mean to select distinct. Unless you're actually going to do something with the group, it's easier to see your intention with select distinct rather than group by at the end. IO cost is the same but one indicates your intention better IMHO.

With SQL Server 2008 I would expect In to be quicker as it is equivalent to this.
SELECT Car.Col1, Car.Col2, Car.Col3
FROM Car
WHERE EXISTS(SELECT * FROM Car_Part
WHERE Car_Part.Car_Id = Car.Car_Id
AND Car_Part.Part_Id = #part_to_look_for
)
i.e. it only has to check for the existence of the row not join it then remove duplicates. This is discussed here.

Related

SQL Server: how to do an efficient cross join

My data is structured as follows:
create table data (id int, cluster int, weight float);
insert into data values (99,1,4);
insert into data values (99,2,3);
insert into data values (99,3,4);
insert into data values (1234,2,5);
insert into data values (1234,3,2);
insert into data values (1234,4,3);
Then I have to impute some values because the vector is of certain lenght x:
declare #startnum int=0
declare #endnum int=4036;
with gen as (select #startnum as num
union ALL
select num+1 from gen where num+1<=#endnum)
select * into gen from gen -- store the numbers
option(maxrecursion 10000)
I then have to cross join the values stored in gen but this is done on two very large tables (not as in the current example), currently my query is running for over 2 hours and I start to think there is something wrong. Any ideas on how I can make this procedure faster and more correct?
Here's what I doing right now.
select id, cluster, max(v) as weight
from (select id, cluster, case when cluster=num then weight else 0 end as v
from data
cross join gen) cross_num
group by id, cluster;
go
EDIT: It is the last query that is running very slowly, and of course I have a super large dataset :)
Note: I also wonder what the Paste the Plan is exactly, I actually don't know how to look for this, can someone give me a resource I can look up and try to understand it?
So, the problem here is that you're creating a massive Cartesian product and aggregating at the same time.
However, we might be able to cheat if your data lines up well. This may also totally backfire if it lines up poorly. I can't see your data so I don't know what's going on.
I'm going to write this using an empty staging table or temp table. You could write it as a series of CTE expressions, but in my experience those do not perform quite as well. Ideally you can take these queries and wrap them in a stored procedure.
So, the primary key for your table can't be id, cluster, because you're aggregating on that group. If id, cluster is not very selective -- meaning that there are a lot of records for each id, cluster combination -- then we might be able to significantly reduce the amount of work done here. If there's 5 records for each id, cluster, then this will probably not help much but if there's 100,000 for each id, cluster then it will probably help a lot.
First, create your gen table. I recommend creating a clustered primary key on gen.num.
Second, let's start building the data. Remember, I'm assuming StagingTable is empty.
Here's the first query that does the real work:
INSERT INTO StagingTable (id, cluster, weight)
SELECT id, cluster, MAX(weight) AS weight
FROM data
GROUP BY id, cluster
The query would benefit from an index, but it will depend on your data if id, cluster, weight is better or worse than cluster, id, weight. However, before you run this you should disable any indexes on StagingTable and then rebuild the index after running at least this first insert.
Depending on your data, you may require or benefit from or should avoid using a WHERE cluster BETWEEN 0 AND 4036 clause on the above query as well. It's not clear to me if there are 4037 clusters numbered 0 to 4036, or if you're only interested in clusters 0 to 4036 but there are more, or if you're only interested in creating "default" records of weight 0 for clusters 0 to 4036 but want all clusters aggregated if they happen to go higher.
Now, think about what's in StagingTable. Everything that we've loaded into that table is everywhere that there is an id, cluster in the data table. Critically, every id we might need will be in StagingTable, even if it's missing one or more values for cluster.
Now we just need to fill in the missing cluster values for each id, and we know that the weight of the missing clusters is 0.
INSERT INTO StagingTable (id, cluster, weight)
SELECT DISTINCT s.id, g.num, 0
FROM StagingTable s
INNER JOIN gen g
ON g.num BETWEEN 0 AND 4036
WHERE NOT EXISTS (
SELECT 1
FROM StagingTable s2
WHERE s2.id = s.id
AND s2.cluster = g.num
)
The INNER JOIN gen g ON g.num BETWEEN 0 AND 4036 may not be necessary if gen is always going to be numbers 0 to 4036. In that case you can just use CROSS JOIN gen g.
The EXISTS is necessary to remove the duplicate rows.
Again, this query could benefit from an index on StagingTable, but without seeing your actual data it's a little difficult to tell exactly what you need (id, cluster) is one possibility, but (cluster, id) may actually work better. Ideally, it should be a clustered primary key.
Edit: Just realized my original second query wouldn't work in some cases. I've modified it to correct the logic.

Adding Conditional around query increases time by over 2400%

Update: I will get query plan as soon as I can.
We had a poor performing query that took 4 minutes for a particular organization. After the usual recompiling the stored proc and updating statistics didn't help, we re-wrote the if Exists(...) to a select count(*)... and the stored procedure when from 4 minutes to 70 milliseconds. What is the problem with the conditional that makes a 70 ms query take 4 minutes? See the examples
These all take 4+ minutes:
if (
SELECT COUNT(*)
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL)) > 0
print 'records';
-
IF (EXISTS(
SELECT *
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL))
print 'records'
This all take 70 milliseconds:
Declare #recordCount INT;
SELECT #recordCount = COUNT(*)
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL);
IF(#recordCount > 0)
print 'records';
It doesn't make sense to me why moving the exact same Count(*) query into an if statement causes such degradation or why 'Exists' is slower than Count. I even tried the exists() in a select CASE WHEN Exists() and it is still 4+ minutes.
Given that my previous answer was mentioned, I'll try to explain again because these things are pretty tricky. So yes, I think you're seeing the same problem as the other question. Namely a row goal issue.
So to try and explain what's causing this I'll start with the three types of joins that are at the disposal of the engine (and pretty much categorically): Loop Joins, Merge Joins, Hash Joins. Loop joins are what they sound like, a nested loop over both sets of data. Merge Joins take two sorted lists and move through them in lock-step. And Hash joins throw everything in the smaller set into a filing cabinet and then look for items in the larger set once the filing cabinet has been filled.
So performance wise, loop joins require pretty much no set up and if you're only looking for a small amount of data they're really optimal. Merge are the best of the best as far as join performance for any data size, but require data to be already sorted (which is rare). Hash Joins require a fair amount of setup but allow large data sets to be joined quickly.
Now we get to your query and the difference between COUNT(*) and EXISTS/TOP 1. So the behavior you're seeing is that the optimizer thinks that rows of this query are really likely (you can confirm this by planning the query without grouping and seeing how many records it thinks it will get in the last step). In particular it probably thinks that for some table in that query, every record in that table will appear in the output.
"Eureka!" it says, "if every row in this table ends up in the output, to find if one exists I can do the really cheap start-up loop join throughout because even though it's slow for large data sets, I only need one row." But then it doesn't find that row. And doesn't find it again. And now it's iterating through a vast set of data using the least efficient means at its disposal for weeding through large sets of data.
By comparison, if you ask for the full count of data, it has to find every record by definition. It sees a vast set of data and picks the choices that are best for iterating through that entire set of data instead of just a tiny sliver of it.
If, on the other hand, it really was correct and the records were very well correlated it would have found your record with the smallest possible amount of server resources and maximized its overall throughput.

How can I optimize this query? I've to use two INNER JOIN for the same table (SQL Server)

Is there any way where I can avoid to do two INNER JOIN for the same table in this case?
SELECT B.CostCatCd As CostCatCd,
F.CountryDesc AS SenderCountry,
B.SenderCompanycd AS SenderCompanyCd,
D.CountryDesc As ReceivingCountry,
B.BillCompanycd AS ReceivingCompanyCd,
SUM(B.BillAmt) as Amount
FROM Bill B
INNER JOIN Company C
ON B.FY = C.FY
AND B.CycleCd = C.CycleCd
AND B.BillCompanyCd = C.CompanyCd
INNER JOIN Country D
ON B.FY = D.FY
AND B.CycleCd = D.CycleCd
AND C.CountryCd = D.CountryCd
INNER JOIN Company E
ON B.FY = E.FY
AND B.CycleCd = E.CycleCd
AND B.SenderCompanyCd = E.CompanyCd
INNER JOIN Country F
ON B.FY = F.FY
AND B.CycleCd = F.CycleCd
AND E.CountryCd = F.CountryCd
I'm trying to improve the performance in a SP and maybe this is something that may be updated. I've the same concern for both tables (Company & Country).
Thanks in advance!
Without the details, it's not so simple to give suggestions, but you should look into actual query plan and statistics IO output. Those give quite good idea what's going on with your SQL.
If the query is running slow, you should check the following things:
The table with biggest logical reads
Scans in query plan
Key lookups in query plan when it happens for a large number of rows
Spools, Sorts, Spills into temp db
For indexing it looks like a good candidate for indexes would be:
Company: CompanyCd, CycleCd, FY (+ CountryCd as included column)
Country: CountryCd, CycleCd, FY (+ CountryDesc as included column)
Everything of course depends on how often the rows are being updated, since indexes will slow those (slightly), but guessing that companies or countries don't get many updates. I made a guess about selectivity of the columns and that's why the columns in the index are in that order.
Indexing Bill properly is a good idea too, but since where clause is missing it's not possible to give any suggestions.

SQL Server execution time of query exponentially increase

I am currently running into some performance issues when running a query which joins multiple tables. The main table has 170 million records, so it is pretty big.
What I encounter is that when I run the query with a top 1000 clause, the results are instantaneous. However, when I increase that to top 8000 the query easily runs for 15 minutes (and then I kill it). Through trial and error I found that the tipping point is with Top 7934 (works like a charm) and Top 7935 (Runs for ever)
Does someone recognise this behaviour and sees what I am doing wrong? Maybe my Query is faulty in some respects.
Thanks in advance
SELECT top 7934 h.DocIDBeg
,h.[Updated By]
,h.Action
,h.Type
,h.Details
,h.[Update Date]
,h.[Updated Field Name]
,i.Name AS 'Value Set To'
,COALESCE(i.Name,'') + COALESCE(h.NewValue, '') As 'Value Set To'
,h.OldValue
FROM
(SELECT g.DocIDBeg
,g.[Updated By]
,g.Action
,g.Type
,g.Details
,g.[Update Date]
,CAST(g.details as XML).value('auditElement[1]/field[1]/#name','nvarchar(max)') as 'Updated Field Name'
,CAST(g.details as XML).value('(/auditElement//field/setChoice/node())[1]','nvarchar(max)') as 'value'
,CAST(g.details as XML).value('(/auditElement//field/newValue/node())[1]','nvarchar(max)') as 'NewValue'
,CAST(g.details as XML).value('(/auditElement//field/oldValue/node())[1]','nvarchar(max)') as 'OldValue'
FROM(
SELECT a.ArtifactID
,f.DocIDBeg
,b.FullName AS 'Updated By'
,c.Action
,e.ArtifactType AS 'Type'
,a.Details
,a.TimeStamp AS 'Update Date'
FROM [EDDS1015272].[EDDSDBO].[AuditRecord] a
LEFT JOIN [EDDS1015272].[EDDSDBO].AuditUser b
ON a.UserID = b.UserID
LEFT JOIN [EDDS1015272].[EDDSDBO].AuditAction c
ON a.Action = c.AuditActionID
LEFT JOIN [EDDS1015272].[EDDSDBO].[Artifact] d
ON a.ArtifactID = d.ArtifactID
LEFT JOIN [EDDS1015272].[EDDSDBO].[ArtifactType] e
ON d.ArtifactTypeID = e.ArtifactTypeID
INNER JOIN [EDDS1015272].[EDDSDBO].[Document] f
ON a.ArtifactID = f.ArtifactID
) g
) h
LEFT JOIN [EDDS1015272].[EDDSDBO].[Code] i
ON h.value = i.ArtifactID
I used to work with data warehouses a lot and encountered similar problems quite often. The root cause is obviously in memory usage like it was already mentioned here. I don't think that rewriting your query will help a lot if you really need to query all 170 million records and I don't think that it is OK for you to wait for more memory resources.
So here is just a simple workaround from me:
Try to split your query. For example, first query all data you need from AuditRecord record table joined to AuditUser table and store the result in another(temporary table for example) table. Then join this new table with Artifact table and so on. In this case this steps will require less memory one by one then running the whole query and have it hung out. So in the long run you will have not a query but a scrip which will be easy to track as you can print out some statuses in the console and which will do his job unlike the query which never ends
Also make sure that you really need to query all this data at once, because I can think of no use cases why you need it, but still if it is an application then you should implement paging, if it is some export functionality then maybe there is a timeline you can use to batch data. For example to export data on a daily basis and query only the data from yersterday. In this case you will come up with an incremental export.
"Through trial and error I found that the tipping point is with Top 7934 (works like a charm) and Top 7935 (Runs for ever)"
This sounds very much like a spill. Adam Mechanic does a nice demo of the internals of this in the video below. Basically the top forces a sort which requires memory. If the memory grant is not big enough to complete the operation, some of it gets done on disk.
https://www.youtube.com/watch?v=j5YGdIk3DXw
Go to 1:03:50 to see Adam demo a spill. In his query, 668,935 rows do not spill but 668,936 rows do and the query time more than doubles.
Watch the whole session if you have time. Very good for performance tuning!
Could also be the tipping point, as #Remus suggested, but it's all guessing without knowing the actual plan.
i think the subselects are forcing the server to fetch all before the filter can be applied
this will couse more memory usage (xlm fields) and make it hard to use a decent qry plan
as to the strange top behavior: top has a big influence on qry plan generation.
it is possible that the 7935 is a cutoff point for 1 optimal plan and that sql server will choose a different path when it needs to fetch more.
or it could go back to the memory and run out of mem on 7935
update:
i reworked your qry to eliminate the nested selects, i'm not saying its now going to be that mutch faster but it eliminates some fields that werent used and it should be easyer to understand and optimize based on the qry plan.
since we don't now the exact size of each table and we can hardly run the qry to test its impossible to give you the best answer. but i could try some tips:
1 step would be to check if you need all the left joins and turn them into inner if it is not needed ex: AuditUser, an AuditRecord could always have a user?
an other thing you could try is to put the data of preferably the smaller tables in a tmp table and join the bigger tables to that tmp table, possible eliminating a lot of records to join
if possible you could denormalize a bit and for example put the username in the auditrecord 2 so you would eliminate the join on AuditUser alltogether
but it is up to wat you need wat you can/are allowed to and the data/server
SELECT top 7934 f.DocIDBeg
,b.FullName AS 'Updated By'
,c.Action
,e.ArtifactType AS 'Type'
,a.Details
,a.TimeStamp AS 'Update Date'
,CAST(a.Details as XML).value('auditElement[1]/field[1]/#name','nvarchar(max)') as 'Updated Field Name'
,i.Name AS 'Value Set To'
,COALESCE(i.Name,'') + COALESCE(CAST(a.Details as XML).value('(/auditElement//field/newValue/node())[1]','nvarchar(max)') as 'NewValue', '') As 'Value Set To'
,CAST(a.Details as XML).value('(/auditElement//field/oldValue/node())[1]','nvarchar(max)') as 'OldValue'
FROM [EDDS1015272].[EDDSDBO].[AuditRecord] a
LEFT JOIN [EDDS1015272].[EDDSDBO].AuditUser b
ON a.UserID = b.UserID
LEFT JOIN [EDDS1015272].[EDDSDBO].AuditAction c
ON a.Action = c.AuditActionID
LEFT JOIN [EDDS1015272].[EDDSDBO].[Artifact] d
ON a.ArtifactID = d.ArtifactID
LEFT JOIN [EDDS1015272].[EDDSDBO].[ArtifactType] e
ON d.ArtifactTypeID = e.ArtifactTypeID
INNER JOIN [EDDS1015272].[EDDSDBO].[Document] f
ON a.ArtifactID = f.ArtifactID
LEFT JOIN [EDDS1015272].[EDDSDBO].[Code] i
ON CAST(a.details as XML).value('(/auditElement//field/setChoice/node())[1]','nvarchar(max)') = i.ArtifactID

One T-SQL query output to multiple record sets

Don't ask for what, but i need two tables from one SQL query.
Like this...
Select Abc, Dgf from A
and result are two tables
abc
1
1
1
dgf
2
2
2
More details?
Ok lets try.
Now i have sp like this:
SELECT a.* from ActivityView as a with (nolock)
where a.WorkplaceGuid = #WorkplaceGuid
SELECT b.* from ActivityView as a with (nolock)
left join PersonView as b with (nolock) on a.PersonGuid=b.PersonGuid where a.WorkplaceGuid = #WorkplaceGuid
It's cool. But execution time about 22 seconds. I do this because in my programm i have classes that automaticly get data from records set. Class Activity and class Person. That why i can't make it in one recordset. Program didn't parse it.
You can write a stored procedure that has two SELECTs.
SELECT Abc FROM A AS AbcTable;
SELECT Dgf FROM A AS DfgTable;
Depending on your specific requirements, I would consider just submitting two separate queries. I don't see any advantage to combining them.
SQL Server supports legacy COMPUTE BY clause which acts almost like GROUP BY but returns multiple resultsets (the resultsets constituting each group followed by the resultsets with the aggregates):
WITH q AS
(
SELECT 1 AS id
UNION ALL
SELECT 2 AS id
)
SELECT *
FROM q
ORDER BY
id
COMPUTE COUNT(id)
BY id
This, however, is obsolete and is to be removed in the future releases.
Those don't seem to be excessively complicated queries (although select * should in general not be used in production and never when you are doing a join as it needlessly wastes resources sending the value of the joined field twice). Therefore if it is taking 22 seconds, then either you are returning a huge amount of data or you don't have proper indexing.
Have you looked at the execution plans to see what is causing the slowness?

Resources