SQL Server IN vs. EXISTS Performance - sql-server

I'm curious which of the following below would be more efficient?
I've always been a bit cautious about using IN because I believe SQL Server turns the result set into a big IF statement. For a large result set, this could result in poor performance. For small result sets, I'm not sure either is preferable. For large result sets, wouldn't EXISTS be more efficient?
WHERE EXISTS (SELECT * FROM Base WHERE bx.BoxID = Base.BoxID AND [Rank] = 2)
vs.
WHERE bx.BoxID IN (SELECT BoxID FROM Base WHERE [Rank = 2])

EXISTS will be faster because once the engine has found a hit, it will quit looking as the condition has proved true.
With IN, it will collect all the results from the sub-query before further processing.

The accepted answer is shortsighted and the question a bit loose in that:
1) Neither explicitly mention whether a covering index is present in
the left, right, or both sides.
2) Neither takes into account the size of input left side set and
input right side set.
(The question just mentions an overall large result set).
I believe the optimizer is smart enough to convert between "in" vs "exists" when there is a significant cost difference due to (1) and (2), otherwise it may just be used as a hint (e.g. exists to encourage use of an a seekable index on the right side).
Both forms can be converted to join forms internally, have the join order reversed, and run as loop, hash or merge--based on the estimated row counts (left and right) and index existence in left, right, or both sides.

I've done some testing on SQL Server 2005 and 2008, and on both the EXISTS and the IN come back with the exact same actual execution plan, as other have stated. The Optimizer is optimal. :)
Something to be aware of though, EXISTS, IN, and JOIN can sometimes return different results if you don't phrase your query just right: http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx

I'd go with EXISTS over IN, see below link:
SQL Server: JOIN vs IN vs EXISTS - the logical difference
There is a common misconception that IN behaves equally to EXISTS or JOIN in terms of returned results. This is simply not true.
IN: Returns true if a specified value matches any value in a subquery or a list.
Exists: Returns true if a subquery contains any rows.
Join: Joins 2 resultsets on the joining column.
Blog credit: https://stackoverflow.com/users/31345/mladen-prajdic

There are many misleading answers answers here, including the highly upvoted one (although I don't believe their ops meant harm). The short answer is: These are the same.
There are many keywords in the (T-)SQL language, but in the end, the only thing that really happens on the hardware is the operations as seen in the execution query plan.
The relational (maths theory) operation we do when we invoke [NOT] IN and [NOT] EXISTS is the semi join (anti-join when using NOT). It is not a coincidence that the corresponding sql-server operations have the same name. There is no operation that mentions IN or EXISTS anywhere - only (anti-)semi joins. Thus, there is no way that a logically-equivalent IN vs EXISTS choice could affect performance because there is one and only way, the (anti)semi join execution operation, to get their results.
An example:
Query 1 ( plan )
select * from dt where dt.customer in (select c.code from customer c where c.active=0)
Query 2 ( plan )
select * from dt where exists (select 1 from customer c where c.code=dt.customer and c.active=0)

The execution plans are typically going to be identical in these cases, but until you see how the optimizer factors in all the other aspects of indexes etc., you really will never know.

So, IN is not the same as EXISTS nor it will produce the same execution plan.
Usually EXISTS is used in a correlated subquery, that means you will JOIN the EXISTS inner query with your outer query. That will add more steps to produce a result as you need to solve the outer query joins and the inner query joins then match their where clauses to join both.
Usually IN is used without correlating the inner query with the outer query, and that can be solved in only one step (in the best case scenario).
Consider this:
If you use IN and the inner query result is millions of rows of distinct values, it will probably perform SLOWER than EXISTS given that the EXISTS query is performant (has the right indexes to join with the outer query).
If you use EXISTS and the join with your outer query is complex (takes more time to perform, no suitable indexes) it will slow the query by the number of rows in the outer table, sometimes the estimated time to complete can be in days. If the number of rows is acceptable for your given hardware, or the cardinality of data is correct (for example fewer DISTINCT values in a large data set) IN can perform faster than EXISTS.
All of the above will be noted when you have a fair amount of rows on each table (by fair I mean something that exceeds your CPU processing and/or ram thresholds for caching).
So the ANSWER is it DEPENDS. You can write a complex query inside IN or EXISTS, but as a rule of thumb, you should try to use IN with a limited set of distinct values and EXISTS when you have a lot of rows with a lot of distinct values.
The trick is to limit the number of rows to be scanned.
Regards,
MarianoC

To optimize the EXISTS, be very literal; something just has to be there, but you don't actually need any data returned from the correlated sub-query. You're just evaluating a Boolean condition.
So:
WHERE EXISTS (SELECT TOP 1 1 FROM Base WHERE bx.BoxID = Base.BoxID AND [Rank] = 2)
Because the correlated sub-query is RBAR, the first result hit makes the condition true, and it is processed no further.

I know that this is a very old question but I think my answer would add some tips.
I just came across a blog on mssqltips sql exists vs in vs join and it turns out that it is generally the same performance wise.
But the downside of one vs the other are as follows:
The in statement has a downside that it can only compare the two tables on one column.
The join statement will run on duplicate values, while in and exists will ignore duplicates.
But when you look at the execution time there is no big difference.
The interesting thing is when you create an index on the table, the execution from the join is better.
And I think that join has another upside that it's easier to write and understand especially for newcomers.

Off the top of my head and not guaranteed to be correct: I believe the second will be faster in this case.
In the first, the correlated subquery will likely cause the subquery to be run for each row.
In the second example, the subquery should only run once, since not correlated.
In the second example, the IN will short-circuit as soon as it finds a match.

Related

TSQL - What is the fastest way to check for more than one record?

Sometimes I need to check if at least one record is present, usually I use a:
IF EXISTS (SELECT TOP 1 1 FROM [SomeTable] WHERE [Fields] = [Values]) BEGIN
-- action
END
Is there a fast way to check if more than one record is present? I could do something like:
IF EXISTS (SELECT 1 FROM [SomeTable]
WHERE [Fields] = [Values]
HAVING Count(*) > 1)
BEGIN
-- action
END
But I'm not sure if it is the fastest way of doing this as it will test all the records in the set. Is there a faster way?
The 'where' part can be quite complex and could consist of multiple ANDs and ORs.
SQL Server does not generally short circuit aggregate queries. Sometimes it can transform a HAVING COUNT(*) > 0 query to use the same plan as EXISTS (discussed in the comments here) but that's as far as it goes.
A HAVING COUNT(*) > 1 query will always count all rows even though in theory it could stop counting after row no 2.
With that in mind I would use
IF EXISTS(
SELECT * FROM (
SELECT TOP 2 *
FROM [SomeTable]
WHERE [Fields] = [Values]
) T
HAVING COUNT(*)=2)
The TOP 2 iterator will stop requesting rows after the second one is returned and thus allow the inner query to shortcircuit early rather than returning them all and counting them.
Example plans for both versions are below
Regarding the question in the comments about
"How can you tell which one is best? Is it the query cost?"
In the particular case shown in the plans above cost would be a reasonable indication as the estimated and actual row counts are quite accurate and the two plans are very similar except for the addition of the TOP iterator. So the additional cost shown in the plan is entirely a representation of the fact that additional number of rows need to be scanned (and possibly read in from disc) and counted.
It is quite clear cut in this case that this just represents additional work. In other plans it may not be. The addition of the TOP 2 may change the query tree underneath it significantly (e.g. disfavouring plans with blocking iterators)
In that case the cost shown in execution plans may not a reliable metric. Even in actual execution plans the cost shown is based on estimates so is only as good as those are and even if the estimated row counts are good the costs shown are still just based on certain modelling assumptions.
SQL Kiwi puts it well in this recent answer on the DBA site
optimizer cost estimates are mainly only useful for internal server
purposes. They are not intended to be used to assess potential
performance, even at a 'high level'. The model is an abstraction that
happens to work reasonably well for the internal purposes it was
designed for. The chances that estimated costs bear any sensible
resemblance to real execution costs on your hardware and configuration
is very small indeed.
Choose other metrics to compare performance, based on whatever real
issues are important to you.
logical reads (shown when SET STATISTICS IO ON;) are one such metric that can be looked at but again focusing on this exclusively can be misleading. Testing query duration is probably the only reliable way but even that is not an exact science as performance can vary dependent upon concurrent activity on the server (waits for memory grants, DOP available, number of relevant pages in the cache).
In the end it just comes down to getting a query plan that looks to be an efficient use of the resources on your server.
I'm sure there are tricks that'll enable you to perform this check faster - although it'll depend very much upon your schema (especially indexes), and a particular check may work for one situation and not for another.
Something like the below might work for you.
IF EXISTS (SELECT * FROM [SomeTable] T1
INNER JOIN [SomeTable] T2
ON T1.UniqueID <> T2.UniqueID
WHERE T1.[Fields] = T1.[Values]
AND T2.[Fields] = T2.[Values])
BEGIN
-- action
END
Don't bother with top or select 1.
if exists (select * ...)
is just as fast.
I got excellent performance with this solution in a table with 19 million records:
IF EXISTS (
SELECT '1' FROM (
SELECT TOP(2) '1' AS 'N'
FROM TBL_KV3) AS Z
HAVING COUNT(*) > 1
)
SELECT '1'
ELSE
SELECT '0'
Not sure about the performance, but you could use a CTE and COUNT(*)OVER:
WITH Match AS
(
SELECT t1.*, COUNT(*)OVER(PARTITION BY t1.Fields)AS CountFields
FROM SomeTable t1
WHERE t1.Fields=#Values
)
SELECT m1.*
FROM Match m1
WHERE CountFields >= 2
Demo

What is the reason for performance difference?

I am creating a report, for that I have written 2 different type of query. But I am seeing a huge performance difference between these 2 methods. What may be the reason? My main table (suppose table A) contain a date column. I am filtering the data based on date. Around 10 table join I have to do with this table.
First method:
select A.id,A1.name,...
from table A
join A1
join A2 ....A10
where A.Cdate >= #date
and A.Cdate <= #date
Second method:
With CTE as(select A.id from A where A.Cdate>=#date and A.Cdate<=#date)
select CTE.id, A1.name,.... from CTE join A1 join A2....A10
Here second method is fast. What is the reason? In first method, the filtered data of A only will be join with other tables data right?
Execution plans will tell us for sure, but likely if the CTE is able to filter out a lot of rows before the join, a more optimal join approach may have been chosen (for example merge instead of hash). SQL isn't supposed to work that way - in theory those two should execute the same way. But in practice we find that SQL Server's optimizer isn't perfect and can be swayed in different ways based on a variety of factors, including statistics, selectivity, parallelism, a pre-existing version of that plan in the cache, etc.
One suggestion.You should be able to answer this question yourself as you have both the plans. Did you compare the two plans? Are those similar? Also, when performance is bad what do you mean ,is it time or cpu time or IO's or what did you compare?
Thus before you post any question you should check these counters and I am sure they will provide with some kind of answers in most of cases.
CTE is for managing the code it wont improve the performance of query automatically. CTE's will be expanded by optimizer and thus in your case these both should have same queries after transformation or expansion and thus similar plans.
CTE is basically for the handling the more complex code(be it recursive or having subquery), but you need to check the execution plan regarding the improvement of the 2 different queries.
You can check for the use of CTE at : http://msdn.microsoft.com/en-us/library/ms190766(v=sql.105).aspx
Just a note, in many scenarios, temp tables gives better performance then CTE also, so you should give a try to temp tables as well.
Reference : http://social.msdn.microsoft.com/Forums/en/transactsql/thread/d040d19d-016e-4a21-bf44-a0359fb3c7fb

Why use JOIN rather than inner queries

I find myself unwilling to push to using JOIN when I can easily solve the same problem by using an inner query:
e.g.
SELECT COLUMN1, ( SELECT COLUMN1 FROM TABLE2 WHERE TABLE2.ID = TABLE1.TABLE2ID ) AS COLUMN2 FROM TABLE1;
My question is, is this a bad programming practice? I find it easier to read and maintain as opposed to a join.
UPDATE
I want to add that there's some great feedback in here which in essence is pushing be back to using JOIN. I am finding myself less and less involved with using TSQL directly these days as of a result of ORM solutions (LINQ to SQL, NHibernate, etc.), but when I do it's things like correlated subqueries which I find are easier to type out linearly.
Personally, I find this incredibly difficult to read. It isn't the structure a SQL developer expects. By using JOIN, you are keeping all of your table sources in a single spot instead of spreading it throughout your query.
What happens if you need to have three or four joins? Putting all of those into the SELECT clause is going to get hairy.
A join is usually faster than a correlated subquery as it acts on the set of rows rather than one row at a time. I would never let this code go to my production server.
And I find a join much much easier to read and maintain.
If you needed more than one column from the second table, then you would require two subqueries. This typically would not perform as well as a join.
This is not equivalent to JOIN.
If you have multiple rows in TABLE2 for each row in TABLE1, you won't get them.
For each row in TABLE1 you get one row output so you can't get multiple from TABLE2.
This is why I'd use "JOIN": to make sure I get the data I wanted...
After your update: I rarely use correlation except with EXISTS...
The query you use was often used as a replacement for a LEFT JOIN for the engines that lacked it (most notably, PostgreSQL before 7.2)
This approach has some serious drawbacks:
It may fail if TABLE2.ID is not UNIQUE
Some engines will not be able to use anything else than NESTED LOOPS for this query
If you need to select more than one column, you will need to write the subquery several times
If your engine supports LEFT JOIN, use the LEFT JOIN.
In MySQL, however, there are some cases when an aggregate function in a select-level subquery can be more efficient than that in a LEFT JOIN with a GROUP BY.
See this article in my blog for the examples:
Aggregates: subqueries vs. GROUP BY
This is not a bad programming practice at all IMO, it is a little bit ugly though. It can actually be a performance boost in situations where the sub-select is from a very large table while you are expecting a very small result set (you have to consider indexes and platform, 2000 having a different optimizer and all from 2005). Here is how I format it to be easier to read.
select
column1
[column2] = (subselect...)
from
table1
Edit:
This of course assumes that your subselect will only return one value, if not it could be returning you bad results. See gbn's response.
it makes it a lot easier to use other types of joins (left outer, cross, etc) because the syntax for those in subquery terms is less than ideal for readability
At the end of the day, the goal when writing code, beyond functional requirements, is to make the intent of your code clear to a reader. If you use a JOIN, the intent is obvious. If you use a subquery in the manner you describe, it begs the question of why you did it that way. What were you trying to achieve that a JOIN would not have accomplished? In short, you waste the reader's time in trying to determine if the author was solving some problem in an ingenious fashion or if they were writing the code after a hard night of drinking.

Ordering of WHERE clauses for SQL Server

Can it make any difference to query optimisation to have WHERE clauses in a different order for SQL Server?
For example, would the query plan for this:
select * from table where col1 = #var1 and col2 = #var2
be different from this?:
select * from table where col2 = #var2 and col1 = #var1
Of course this is a contrived example, and I have tried more complex ones out. The query plan was the same for both, but I've always wondered whether it was worth ordering WHERE clauses such that the most specific clauses come first, in case the optimiser somehow "prunes" results and could end up being faster.
This is really just a thought experiment, I'm not looking to solve a specific performance problem.
What about other RDBMS's, too?
Every modern RDBMS has a query optimizer which is responsible, among other things, of reordering the conditions. Some optimizers use pretty sophisticated statistics to do this, and they often beat human intuition about what will be a good ordering and what will not. So my guess would be: If you can definitely say "This ordering is better than the other one", so can the optimizer, and it will figure it out by itself.
Conclusion: Don't worry about such things. It is rarely worth the time.
Readability for humans should be your only goal when you determine the order of the conditions in the where clause. For example, if you have a join on two tables, A and B, write all conditions for A, then all conditions for B.
No, the query optimizer will find out anyway which index or statistics to use. I'm not entirely sure, but i even think that boolean expressions in sql is not evaluated from left to right but can be evaluated in any order by the query optimzer.
I don't think it'll make much of a difference..
What does make a difference in all the sql languages is the order in which you use sql functions.
For example:
When you do something like this:
select title, date FROM sometable WHERE to_char(date, 'DD-MM-YYYY') > "01-01-1960"
Would go slower then this:
select title, date FROM sometable WHERE date > to_char('DD-MM-YYYY', %USERVALUE%)
This is because of the number of times the function needs to be evaluated.
Also, the ordering might make a difference in case you are using nested queries. The smaller the result set of the inner query, less scanning the outer query will need to do over it.
Of course, having said that, I second what rsp said in the comment above - indexes are the main key in determining how long a query will take; if we have an index on the column, SQL Server would directly do a SEEK instead of SCANning the values and hence, the ordering would be rendered irrelevant
SQL is "declarative" so it makes no difference. You tell the DBMS what you want, it figures out the best way (subject to cost, time etc).
In .net, it would make a difference because it's procedural and executed in order.

Slow SQL Query due to inner and left join?

Can anyone explain this behavior or how to get around it?
If you execute this query:
select *
from TblA
left join freetexttable ( TblB, *, 'query' ) on TblA.ID = [Key]
inner join DifferentDbCatalog.dbo.TblC on TblA.ID = TblC.TblAID
It will be very very very slow.
If you change that query to use two inner joins instead of a left join, it will be very fast. If you change it to use two left joins instead of an inner join, it will be very fast.
You can observe this same behavior if you use a sql table variable instead of the freetexttable as well.
The performance problem arises any time you have a table variable (or freetexttable) and a table in a different database catalog where one is in an inner join and the other is in a left join.
Does anyone know why this is slow, or how to speed it up?
A general rule of thumb is that OUTER JOINs cause the number of rows in a result set to increase, while INNER JOINs cause the number of rows in a result set to decrease. Of course, there are plenty of scenarios where the opposite is true as well, but it's more likely to work this way than not. What you want to do for performance is keep the size of the result set (working set) as small as possible for as long as possible.
Since both joins match on the first table, changing up the order won't effect the accuracy of the results. Therefore, you probably want to do the INNER JOIN before the LEFT JOIN:
SELECT *
FROM TblA
INNER JOIN DifferentDbCatalog.dbo.TblC on TblA.ID = TblC.TblAID
LEFT JOIN freetexttable ( TblB, *, 'query' ) on TblA.ID = [Key]
As a practical matter, the query optimizer should be smart enough to compile to use the faster option, regardless of which order you specified for the joins. However, it's good practice to pretend that you have a dumb query optimizer, and that query operations happen in order. This helps future maintainers spot potential errors or assumptions about the nature of the tables.
Because the optimizer should re-write things, this probably isn't good enough to fully explain the behavior you're seeing, so you'll still want to examine the execution plan used for each query, and probably add an index as suggested earlier. This is still a good principle to learn, though.
What you should usually do is turn on the "Show Actual Execution Plan" option and then take a close look at what is causing the slowdown. (hover your mouse over each join to see the details) You'll want to make sure that you are getting an index seek and not a table scan.
I would assume what is happening is that SQL is being forced to pull everything from one table into memory in order to do one of the joins. Sometimes reversing the order that you join the tables will also help things.
Putting freetexttable(TblB, *, 'query') into a temp table may help if it's getting called repeatedly in the execution plan.
Index the field you use to perform the join.
A good rule of thumb is to assign an index to any commonly referenced foreign or candidate keys.

Resources