I am creating a report, for that I have written 2 different type of query. But I am seeing a huge performance difference between these 2 methods. What may be the reason? My main table (suppose table A) contain a date column. I am filtering the data based on date. Around 10 table join I have to do with this table.
First method:
select A.id,A1.name,...
from table A
join A1
join A2 ....A10
where A.Cdate >= #date
and A.Cdate <= #date
Second method:
With CTE as(select A.id from A where A.Cdate>=#date and A.Cdate<=#date)
select CTE.id, A1.name,.... from CTE join A1 join A2....A10
Here second method is fast. What is the reason? In first method, the filtered data of A only will be join with other tables data right?
Execution plans will tell us for sure, but likely if the CTE is able to filter out a lot of rows before the join, a more optimal join approach may have been chosen (for example merge instead of hash). SQL isn't supposed to work that way - in theory those two should execute the same way. But in practice we find that SQL Server's optimizer isn't perfect and can be swayed in different ways based on a variety of factors, including statistics, selectivity, parallelism, a pre-existing version of that plan in the cache, etc.
One suggestion.You should be able to answer this question yourself as you have both the plans. Did you compare the two plans? Are those similar? Also, when performance is bad what do you mean ,is it time or cpu time or IO's or what did you compare?
Thus before you post any question you should check these counters and I am sure they will provide with some kind of answers in most of cases.
CTE is for managing the code it wont improve the performance of query automatically. CTE's will be expanded by optimizer and thus in your case these both should have same queries after transformation or expansion and thus similar plans.
CTE is basically for the handling the more complex code(be it recursive or having subquery), but you need to check the execution plan regarding the improvement of the 2 different queries.
You can check for the use of CTE at : http://msdn.microsoft.com/en-us/library/ms190766(v=sql.105).aspx
Just a note, in many scenarios, temp tables gives better performance then CTE also, so you should give a try to temp tables as well.
Reference : http://social.msdn.microsoft.com/Forums/en/transactsql/thread/d040d19d-016e-4a21-bf44-a0359fb3c7fb
Related
Scenario
Quick background on this one: I am attempting to optimize the use of an inline table-valued function uf_GetVisibleCustomers(#cUserId). The iTVF wraps a view CustomerView and filters out all rows containing data for customers whom the provided requesting user is not permitted to see. This way, should selection criteria ever change in the future for certain user types, we won't have to implement that new condition a hundred times (hyperbole) all over the SQL codebase.
Performance is not great, however, so I want to fix that before encouraging use of the iTVF. Changed database object names here just so it's easier to demonstrate (hopefully).
Queries
In attempting to optimize our iTVF uf_GetVisibleCustomers, I've noticed that the following SQL …
CREATE TABLE #tC ( idCustomer INT )
INSERT #tC
SELECT idCustomer
FROM [dbo].[uf_GetVisibleCustomers]('requester')
SELECT T.fAmount
FROM [Transactions] T
JOIN #tC C ON C.idCustomer = T.idCustomer
… is orders of magnitude faster than my original (IMO more readable, likely to be used) SQL here…
SELECT T.fAmount
FROM [Transactions] T
JOIN [dbo].[uf_GetVisibleCustomers]('requester') C ON C.idCustomer = T.idCustomer
I don't get why this is. The former (top block of SQL) returns ~700k rows in 17 seconds on a fairly modest development server. The latter (second block of SQL) returns the same number of rows in about ten minutes when there is no other user activity on the server. Maybe worth noting that there is a WHERE clause, however I have omitted it here for simplicity; it is the same for both queries.
Execution Plan
Below is the execution plan for the first. It enjoys automatic parallelism as mentioned while the latter query isn't worth displaying here because it's just massive (expands the entire iTVF and underlying view, subqueries). Anyway, the latter also does not execute in parallel (AFAIK) to any extent.
My Questions
Is it possible to achieve performance comparable to the first block without a temp table?
That is, with the relative simplicity and human-readability of the slower SQL.
Why is a join to a temp table faster than a join to iTVF?
Why is it faster to use a temp table than an in-memory table populated the same way?
Beyond those explicit questions, if someone can point me in the right direction toward understanding this better in general then I would be very grateful.
Without seeing the DDL for your inline function - it's hard to say what the issue is. It would also help to see the actual execution plans for both queries (perhaps you could try: https://www.brentozar.com/pastetheplan/). That said, I can offer some food for thought.
As you mentioned, the iTVF accesses the underlying tables, views and associated indexes. If your statistics are not up-to-date you can get a bad plan, that won't happen with your temp table. On that note, too, how long does it take to populate that temp table?
Another thing to look at (again, this is why DDL is helpful) is: are the data type's the same for Transactions.idCustomer and #TC.idCustomer? I see a hash match in the plan you posted which seems bad for a join between two IDs (a nested loops or merge join would be better). This could be slowing both queries down but would appear to have a more dramatic impact on the query that leverages your iTVF.
Again this ^^^ is speculation based on my experience. A couple quick things to try (not as a perm fix but for troubleshooting):
1. Check to see if re-compiling your query when using the iTVF speeds things up (this would be a sign of a bad stats or a bad execution plan being cached and re-used)
2. Try forcing a parallel plan for the iTVF query. You can do this by adding OPTION (QUERYTRACEON 8649) to the end of your query of by using make_parallel() by Adam Machanic.
We are trying to optimize some of our queries.
One query is doing the following:
SELECT t.TaskID, t.Name as Task, '' as Tracker, t.ClientID, (<complex subquery>) Date,
INTO [#Gadget]
FROM task t
SELECT TOP 500 TaskID, Task, Tracker, ClientID, dbo.GetClientDisplayName(ClientID) as Client
FROM [#Gadget]
order by CASE WHEN Date IS NULL THEN 1 ELSE 0 END , Date ASC
DROP TABLE [#Gadget]
(I have removed the complex subquery. I don't think it's relevant other than to explain why this query has been done as a two stage process.)
I thought it would be far more efficient to merge this down into a single query using subqueries as:
SELECT TOP 500 TaskID, Task, Tracker, ClientID, dbo.GetClientDisplayName(ClientID)
FROM
(
SELECT t.TaskID, t.Name as Task, '' as Tracker, t.ClientID, (<complex subquery>) Date,
FROM task t
) as sub
order by CASE WHEN Date IS NULL THEN 1 ELSE 0 END , Date ASC
This would give the optimizer better information to work out what was going on and avoid any temporary tables. I assumed it should be faster.
But it turns out it is a lot slower. 8 seconds vs. under 5 seconds.
I can't work out why this would be the case, as all my knowledge of databases imply that subqueries would always be faster than using temporary tables.
What am I missing?
Edit --
From what I have been able to see from the query plans, both are largely identical, except for the temporary table which has an extra "Table Insert" operation with a cost of 18%.
Obviously as it has two queries the cost of the Sort Top N is a lot higher in the second query than the cost of the Sort in the Subquery method, so it is difficult to make a direct comparison of the costs.
Everything I can see from the plans would indicate that the subquery method would be faster.
"should be" is a hazardous thing to say of database performance. I have often found that temp tables speed things up, sometimes dramatically. The simple explanation is that it makes it easier for the optimiser to avoid repeating work.
Of course, I've also seen temp tables make things slower, sometimes much slower.
There is no substitute for profiling and studying query plans (read their estimates with a grain of salt, though).
Obviously, SQL Server is choosing the wrong query plan. Yes, that can happen, I've had exactly the same scenario as you a few times.
The problem is that optimizing a query (you mention a "complex subquery") is a non-trivial task: If you have n tables, there are roughly n! possible join orders -- and that's just the beginning. So, it's quite plausible that doing (a) first your inner query and (b) then your outer query is a good way to go, but SQL Server cannot deduce this information in reasonable time.
What you can do is to help SQL Server. As Dan Tow writes in his great book "SQL Tuning", the key is usually the join order, going from the most selective to the least selective table. Using common sense (or the method described in his book, which is a lot better), you could determine which join order would be most appropriate and then use the FORCE ORDER query hint.
Anyway, every query is unique, there is no "magic button" to make SQL Server faster. If you really want to find out what is going on, you need to look at (or show us) the query plans of your queries. Other interesting data is shown by SET STATISTICS IO, which will tell you how much (costly) HDD access your query produces.
I have re-iterated this question here: How can I force a subquery to perform as well as a #temp table?
The nub of it is, yes, I get that sometimes the optimiser is right to meddle with your subqueries as if they weren't fully self contained but sometimes it makes a bad wrong turn when it tries to be clever in a way that we're all familiar with. I'm saying there must be a way of switching off that "cleverness" where necessary instead of wrecking a View-led approach with temp tables.
I find myself unwilling to push to using JOIN when I can easily solve the same problem by using an inner query:
e.g.
SELECT COLUMN1, ( SELECT COLUMN1 FROM TABLE2 WHERE TABLE2.ID = TABLE1.TABLE2ID ) AS COLUMN2 FROM TABLE1;
My question is, is this a bad programming practice? I find it easier to read and maintain as opposed to a join.
UPDATE
I want to add that there's some great feedback in here which in essence is pushing be back to using JOIN. I am finding myself less and less involved with using TSQL directly these days as of a result of ORM solutions (LINQ to SQL, NHibernate, etc.), but when I do it's things like correlated subqueries which I find are easier to type out linearly.
Personally, I find this incredibly difficult to read. It isn't the structure a SQL developer expects. By using JOIN, you are keeping all of your table sources in a single spot instead of spreading it throughout your query.
What happens if you need to have three or four joins? Putting all of those into the SELECT clause is going to get hairy.
A join is usually faster than a correlated subquery as it acts on the set of rows rather than one row at a time. I would never let this code go to my production server.
And I find a join much much easier to read and maintain.
If you needed more than one column from the second table, then you would require two subqueries. This typically would not perform as well as a join.
This is not equivalent to JOIN.
If you have multiple rows in TABLE2 for each row in TABLE1, you won't get them.
For each row in TABLE1 you get one row output so you can't get multiple from TABLE2.
This is why I'd use "JOIN": to make sure I get the data I wanted...
After your update: I rarely use correlation except with EXISTS...
The query you use was often used as a replacement for a LEFT JOIN for the engines that lacked it (most notably, PostgreSQL before 7.2)
This approach has some serious drawbacks:
It may fail if TABLE2.ID is not UNIQUE
Some engines will not be able to use anything else than NESTED LOOPS for this query
If you need to select more than one column, you will need to write the subquery several times
If your engine supports LEFT JOIN, use the LEFT JOIN.
In MySQL, however, there are some cases when an aggregate function in a select-level subquery can be more efficient than that in a LEFT JOIN with a GROUP BY.
See this article in my blog for the examples:
Aggregates: subqueries vs. GROUP BY
This is not a bad programming practice at all IMO, it is a little bit ugly though. It can actually be a performance boost in situations where the sub-select is from a very large table while you are expecting a very small result set (you have to consider indexes and platform, 2000 having a different optimizer and all from 2005). Here is how I format it to be easier to read.
select
column1
[column2] = (subselect...)
from
table1
Edit:
This of course assumes that your subselect will only return one value, if not it could be returning you bad results. See gbn's response.
it makes it a lot easier to use other types of joins (left outer, cross, etc) because the syntax for those in subquery terms is less than ideal for readability
At the end of the day, the goal when writing code, beyond functional requirements, is to make the intent of your code clear to a reader. If you use a JOIN, the intent is obvious. If you use a subquery in the manner you describe, it begs the question of why you did it that way. What were you trying to achieve that a JOIN would not have accomplished? In short, you waste the reader's time in trying to determine if the author was solving some problem in an ingenious fashion or if they were writing the code after a hard night of drinking.
I'm curious which of the following below would be more efficient?
I've always been a bit cautious about using IN because I believe SQL Server turns the result set into a big IF statement. For a large result set, this could result in poor performance. For small result sets, I'm not sure either is preferable. For large result sets, wouldn't EXISTS be more efficient?
WHERE EXISTS (SELECT * FROM Base WHERE bx.BoxID = Base.BoxID AND [Rank] = 2)
vs.
WHERE bx.BoxID IN (SELECT BoxID FROM Base WHERE [Rank = 2])
EXISTS will be faster because once the engine has found a hit, it will quit looking as the condition has proved true.
With IN, it will collect all the results from the sub-query before further processing.
The accepted answer is shortsighted and the question a bit loose in that:
1) Neither explicitly mention whether a covering index is present in
the left, right, or both sides.
2) Neither takes into account the size of input left side set and
input right side set.
(The question just mentions an overall large result set).
I believe the optimizer is smart enough to convert between "in" vs "exists" when there is a significant cost difference due to (1) and (2), otherwise it may just be used as a hint (e.g. exists to encourage use of an a seekable index on the right side).
Both forms can be converted to join forms internally, have the join order reversed, and run as loop, hash or merge--based on the estimated row counts (left and right) and index existence in left, right, or both sides.
I've done some testing on SQL Server 2005 and 2008, and on both the EXISTS and the IN come back with the exact same actual execution plan, as other have stated. The Optimizer is optimal. :)
Something to be aware of though, EXISTS, IN, and JOIN can sometimes return different results if you don't phrase your query just right: http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
I'd go with EXISTS over IN, see below link:
SQL Server: JOIN vs IN vs EXISTS - the logical difference
There is a common misconception that IN behaves equally to EXISTS or JOIN in terms of returned results. This is simply not true.
IN: Returns true if a specified value matches any value in a subquery or a list.
Exists: Returns true if a subquery contains any rows.
Join: Joins 2 resultsets on the joining column.
Blog credit: https://stackoverflow.com/users/31345/mladen-prajdic
There are many misleading answers answers here, including the highly upvoted one (although I don't believe their ops meant harm). The short answer is: These are the same.
There are many keywords in the (T-)SQL language, but in the end, the only thing that really happens on the hardware is the operations as seen in the execution query plan.
The relational (maths theory) operation we do when we invoke [NOT] IN and [NOT] EXISTS is the semi join (anti-join when using NOT). It is not a coincidence that the corresponding sql-server operations have the same name. There is no operation that mentions IN or EXISTS anywhere - only (anti-)semi joins. Thus, there is no way that a logically-equivalent IN vs EXISTS choice could affect performance because there is one and only way, the (anti)semi join execution operation, to get their results.
An example:
Query 1 ( plan )
select * from dt where dt.customer in (select c.code from customer c where c.active=0)
Query 2 ( plan )
select * from dt where exists (select 1 from customer c where c.code=dt.customer and c.active=0)
The execution plans are typically going to be identical in these cases, but until you see how the optimizer factors in all the other aspects of indexes etc., you really will never know.
So, IN is not the same as EXISTS nor it will produce the same execution plan.
Usually EXISTS is used in a correlated subquery, that means you will JOIN the EXISTS inner query with your outer query. That will add more steps to produce a result as you need to solve the outer query joins and the inner query joins then match their where clauses to join both.
Usually IN is used without correlating the inner query with the outer query, and that can be solved in only one step (in the best case scenario).
Consider this:
If you use IN and the inner query result is millions of rows of distinct values, it will probably perform SLOWER than EXISTS given that the EXISTS query is performant (has the right indexes to join with the outer query).
If you use EXISTS and the join with your outer query is complex (takes more time to perform, no suitable indexes) it will slow the query by the number of rows in the outer table, sometimes the estimated time to complete can be in days. If the number of rows is acceptable for your given hardware, or the cardinality of data is correct (for example fewer DISTINCT values in a large data set) IN can perform faster than EXISTS.
All of the above will be noted when you have a fair amount of rows on each table (by fair I mean something that exceeds your CPU processing and/or ram thresholds for caching).
So the ANSWER is it DEPENDS. You can write a complex query inside IN or EXISTS, but as a rule of thumb, you should try to use IN with a limited set of distinct values and EXISTS when you have a lot of rows with a lot of distinct values.
The trick is to limit the number of rows to be scanned.
Regards,
MarianoC
To optimize the EXISTS, be very literal; something just has to be there, but you don't actually need any data returned from the correlated sub-query. You're just evaluating a Boolean condition.
So:
WHERE EXISTS (SELECT TOP 1 1 FROM Base WHERE bx.BoxID = Base.BoxID AND [Rank] = 2)
Because the correlated sub-query is RBAR, the first result hit makes the condition true, and it is processed no further.
I know that this is a very old question but I think my answer would add some tips.
I just came across a blog on mssqltips sql exists vs in vs join and it turns out that it is generally the same performance wise.
But the downside of one vs the other are as follows:
The in statement has a downside that it can only compare the two tables on one column.
The join statement will run on duplicate values, while in and exists will ignore duplicates.
But when you look at the execution time there is no big difference.
The interesting thing is when you create an index on the table, the execution from the join is better.
And I think that join has another upside that it's easier to write and understand especially for newcomers.
Off the top of my head and not guaranteed to be correct: I believe the second will be faster in this case.
In the first, the correlated subquery will likely cause the subquery to be run for each row.
In the second example, the subquery should only run once, since not correlated.
In the second example, the IN will short-circuit as soon as it finds a match.
I have a relation between two tables with 600K rows and my first question is, is that a lot of data? It doesn't seem like a lot (in terms of rows, not bytes)
I can write a query like this
SELECT EntityID, COUNT(*)
FROM QueryMembership
GROUP BY EntityID
And it completes in now time at all, but when I do this.
SELECT EntityID, COUNT(*)
FROM QueryMembership
WHERE PersonID IN (SELECT PersonID FROM GetAcess(1))
GROUP BY EntityID
The thing takes 3-4 seconds to complete, despite just returning about 183 rows. SELECT * FROM QueryMembership takes about 12-13 seconds.
What I don't understand is how a filter like this would take so long, as soon as I introduce this table value function. The function it self doesn't take any time at all to return it's result and no matter if I write it as a CTE or some bizarre sub query the result is the same.
However, if it defer the filter, by inserting the result of the first select into a temporary table #temp then using the GetAccess UDF the entire thing goes about three times as fast.
I would really like some in-depth technical help on this matter. Where I should start look, and how I can analyze the execution plan to figure out what's going on.
There's an excellent series of posts on execution plans and how to read and interpret them - and a totally free e-book on the topic as well! - on the excellent Simple-Talk site.
Check them out - well worth the time!
Execution Plan Basics
SQL Server Execution Plans
Understanding More Complex Query Plans
Graphical Execution Plans for Simple SQL Queries
SQL Server Execution Plans - free e-book download
600k rows is not a particularly large amount. However, you are getting to the point where server configuration (disks, non-SQL load, etc) matters, so if your server wasn't carefully put together you should look at that now rather than later.
Analyzing execution plans is one of those things that you tend to pick up over time. The book "Inside SQL Server" is (was?) pretty nice for learning how things work internally, which helps guide you a bit as you're optimzing.
I would personally try rewriting the above query as a join, IN often doesn't perform as well as you might hope. Something like:
SELECT
EntityID,
COUNT(*)
FROM
QueryMembership q
join GetAccess(1) a on a.PersonID = q.PersonID
GROUP BY
EntityID
SELECT EntityID, COUNT(*)
FROM QueryMembership
WHERE PersonID IN (SELECT PersonID FROM GetAcess(1))
GROUP BY EntityID
The embedded subquery is expensive. as you said using a temporary table is perfect alternative solution.
I suspect that the reasons for your slowdown may be similar to those in this quesiton:
how to structure an index for group by in Sql Server
An execution plan will answer the question as to why the second query is slower, however I suspect it will be because SQL server can use indexes to look up aggregate functions (such as COUNT and MAX) using relatively inexpensive operations on some index.
If you combine a filter and a group however, SQL server can no longer use this trick and is forced to evaluate the value of COUNT or MAX based on the filtered result set, leading to expensive lookups.
600k rows is a fairly reasonable / small table size, however its big enough so that things like table scans or RDI lookups against large portions of the table will start becoming expensive.
I'd be interested to see the execution plan to understand whats going on.