I have a query that needs to incorporate conditional logic. There are 4 cases that need to be considered, and the resulting sets are disjoint.
I can implement the query using either a single SELECT and CASE/WHEN statements or using multiple SELECT statements and UNION ALL.
In general, is one of these implementations likely to be faster than the other? If so, why?
A Union does that many of selects together so a case when will be better generally IMHO if the from statement is not that complex and all other things are being equal. But they are NOT similar sql results:
A 'Case when ...' will add another horizontal row and by default a union of a select must have that amount of columns in the set being union'd so it will add more rows. For instance if you queried three separate tables and then union'd them together you are doing three selects, however if you just did three case whens it would be efficient if you were querying one table. But you could be querying five. Without knowing the source the answer really is: 'it depends'.
I just set the ole 'set statistics time on' when doing quick timing of the SQL engine to see. People can argue semantics but the engine does not lie when it tells you what is going on. SQL 2005 and higher I believe also has the 'include actual execution plan' in the menu bar. It is a nice looking little three squares icon in the shape of an L with the L point being in the upper left. If you have something very complex and are getting really into fine tuning that is the tool of choice to examine what the engine is doing under the hood with your query.
This really depends entirely on what the logic and data you expect to be selecting from look like. If you're running this SELECT against huge datasets and the logic is fairly simple like WHEN Val Between A and B THEN C you'll probably get a little bit of an uplift putting the logic in your where clause and doing a UNION ALL but not a ton of difference. On a comparatively small data set, it might not make any difference at all. It also might depend on whether or not you see this code being set in stone, or subject to periodically change. UNION ALL will certainly be quite a few more lines of code, because you're basically writing the same query over and over with different WHERE clauses, but it also may be easier to read and maintain.
Related
I am currently in the progress of a database migration from MS Access to SQL Server. To improve performance of a specific query, I am translating from access to T-SQL and executing server-side. The query in question is essentially made up of almost 15 subqueries branching off in to different directions with varying levels of complexity. The top level query is a culmination (final Select) of all of these queries.
Without actually going into the specifics of the fields and relationships in my queries, I want to ask a question on a generic example.
Take the following:
Top Level Query
|
___________|___________
| |
Query 1 <----------> Query 2
_________________________| Views? |_______________________________
| | | |
Query 1.1 Query 1.2 Query 2.1 Query 2.2
________|______ ______|________
| | | |
Query 1.1.1 Query 1.1.2 Query 2.1.1 Query 2.1.2
| | | |
... ... ... ...
I am attempting to convert the above MS Access query structure to T-SQL, whilst maximising performance. So far I have converted all of Query1 Into a single query starting from the bottom and working my way up. I have achieved by using CTE's to represent every single subquery and then finally selected from this entire CTE tree to produce Query1. Due to the original design of the query, there is a high level of dependency between the subqueries.
Now my question is quite simple actually. With regards to Query2, should I continue to use this same method within the same query window or should I make both Query1 and Query2 seperate entities (Views) and then do a select from each? Or should I just continue adding more CTE's and then get the final Top Level Query result from this one super query?
This is an extremely bastardised version of the actual query, I am working with which has a large number of calculated fields and more subquery levels.
What do you think is the best approach here?
There is no way to say for sure from a diagram like this, but I suspect that you want to use Views for a number of reasons.
1) If the sub-query/view is used in more than one place there is a high chance that caching will allow for results to be shared in more than one place, but it is not as strong effect as a CTE but can be mitigated with a materialized query
2) It is easy turn a view into a materialized view. Then you get huge bonus if it is used multiple times or is used many times before it needs to be refreshed.
3) If you find a slow part it will be isolated to one view -- then you can optimize and change that small section easier.
I would recommend using views for EVERY sub-view if you can. Unless you can demonstrate (via execution plan or testing) that the CTE runs faster.
Final note as someone who has migrated Access to SQL in the past. Access encourages more sub-queries than needed with modern SQL and windowing functions. It is very likely with some analysis these access queries can be made much simpler. Try to find cases where you can roll them up to the parent query
A query you submit, or a "view" is all the same thing.
Your BASIC simple question stands!
Should you use a query on query, or a use a CTE?
Well, first, lets get rid of some confusing you have here.
A CTE is great for eliminaton of the need to build a query, save the query (say as a view) AND THEN query against it.
However, in your question we are mixing up TWO VERY different issues.
Are you doing a query against a query, or in YOUR case using a sub-query? While these two things SEEM simular, they really at not!!!
In the case of a sub-query, using a CTE will in fact save you the pain of having to build a separate query/view and saving that query. In this case, you are I suppose doing a query on query, but it REALLY a sub query. From a performance point of view, I don't believe you find any difference, so do what works best for you. I do in some ways like adopting CTE's since THEN you have have the "whole thing" in one spot. And updates to the whole mess occurs in one place. This can especially be an advantage if you have several sites. So to update things, you ONLY have to update this one huge big saved "thing". I do find this a signficant advantage.
The advantages of breaking out into separate views (as opposed to using CTE's) can often be the simple issue that how do you eat a elephant?
Answer: One bite at a time.
However, I in fact consider the conept and approahc of a sub-query a DIFFERENT issue then building a query on query. One of the really big reasons to using CTE'S in sql server, is SQL server has one REALLY big limitation compared to Access SQL. That limiation of course is being able to re-use derived columns.
eg this:
SELECT ID, Company, State, TaxRate, Purchased, Payments,
(Purchased - Payments) as Balance,
(Balance * TaxRate) as BalanceWithTax
FROM Customers
Of course in T-SQL, you can't re-use expression like you can in Access T-SQL. So the above is a GREAT use of CTE'S. Balance in t-sql cannot be re-used. So you are having to constant repeat expressions in t-sql (my big pet peeve with t-sql). Using a CTE means we CAN use the above ability to repeat a expression.
So I tend to think of the CTE solving two issues, and you should keep these concepts seperate:
I want to eleiminate the need for a query on query, and that includes sub-queries.
So, sure, using CTE for above is a good use.
The SECOND use case is the abilty to repeat use expession columns. This is VERY painful in T-SQL, and CTE's go a long way to reducing this pain point (Access SQL is still far better), but CTE's are at least very helpfull.
So, from a performance point of view, using CTE's to eliminate sub query should not effect perfomance, and as noted you can saving having to create 2-5 seperate queries for this to work.
Then there is a query on query (especially in the above use case of being able to re-use columns in expressions. In this case, I believe some performance advantages exist, but again, likely not enough to justify one approach or the other. So once again, adopting CTE's should be which road is LESS work for you! (but for a very nasty say sub-query that sums() and does some real work, and THEN you need to re-use that columns, then that is really when CTE's shine here.
So as a general coding approach, I used CTE's to advoid query on query (but NOT a sub quer s you are doing). And I use CTE's to gain re-use of a complex expression column.
Using CTE's to eliminate having sub-queries is not really all that great of a benefit. (I mean, just shove the sub-query into the main query - MOST of the time a CTE will not help you).
So, using CTE's just for the concept of a sub-query is not all that great of a advantage. You can, but I don't see great gains from a developer poitn of view. However, in the case of query on query (to gain re-use of column expressions?). Well then the CTE's elemintes the need for a query against the same table/query.
So, for just sub-queries, I can't say CTE's are huge advantage. But for re-use of column expressions, then you MUST either do a query on a query (say a saved view), and THEN you gain re-use of the columns as expressions to be used in additional expressions.
While this is somewhat opinion?
CTE'S ability to allow re-use of columns is their use case, because this elimiantes the need to create a sepeate view. It is not so much that you elimited the need for a seperate view (query on query), but that you gained used of a column exprssion for re-use is the main benefit here.
So, you certainly can use CTE's to eliminate having to create views (yes, a good idea), but in your case you likly could have just used sub-queries anyway, and the CTE's are not really required. For column re-use, you have NO CHOICE in the matter. Since you MUST use a query on query for column expression re-use, the the CTE's will eliminate this need. In your case (at least so far) you really did not need to use a CTE's and you were not being forced to in regards to your solution. For column re-use you have zero choice - you ARE being forced to query on query - so a CTE's eliminates this need.
As far as I can tell, so far you don't really need necessary to use a CTE unless the issue is being able to re-use some columns in other expressions like we could/can in Access sql.
If column re-use is the goal? Then yes, CTE's are a great solution. So it more of a column re-use issue then that of choosing to use query on query. If you did not have the additional views in Access, then no question that adopting CTE's to keep a similar approach and design makes a lot of sense. So the motivation of column re-use is what we lost by going to sql server, and CTE's do a lot to regain this ability.
I'm generating reports from a database that makes extensive use of XML to store time-series data. Annoyingly, most of these entries hold only a single value, complicating everything for no benefit. Looking here on SO, I found a couple of examples using OUTER APPLY to decode these fields into a single value.
One of these queries is timing out on the production machine, so I'm looking for ways to improve its performance. The query contains a dozen lines similar to:
SELECT...
PR.D.value('#A', 'NVARCHAR(16)') AS RP,
...
FROM Profiles LP...
OUTER APPLY LP.VariableRP.nodes('/X/E') RP(D)
...
When I look in the Execution Plan, each of these OUTER APPLYs has a huge operator cost, although I'm not sure that really means anything. In any event, these operators make up 99% of the query time.
Does anyone have any advice on how to improve these sorts of queries? I suspect there's a way to do this without OUTER APPLY, but my google-fu is failing.
Taking this literally
most of these entries hold only a single value
...it should be faster to avoid APPLY (which produces quite an overhead on creating a derived table) and read the one and only value directly:
SELECT LP.VariableRP.value('(/X/E/#A)[1]', 'NVARCHAR(16)') AS RP
FROM Profiles LP
If this does not provide what you need, please show us some examples of your XML, but I doubt this will get much faster.
There are XML indexes, but in most cases they don't help and can make things even worse
You might use some kind of trigger or run-once logic to shift the needed values into a side column (into a related side table) and query from there.
I asked a question here Using cursor in OLTP databases (SQL server)
where people responded saying cursors should never be used.
I feel cursors are very powerful tools that are meant to be used (I don't think Microsoft supports cursors for bad developers).Suppose you have a table where the value of a column in a row is dependent on the value of the same column in the previous row. If it is a one time back end process, don't you think using a cursor would be an acceptable choice?
Off the top of my head I can think of a couple of scenarios where I feel there should be no shame in using cursors. Please let me know if you guys feel otherwise.
A one time back end process to clean bad data which completes execution within a few minutes.
Batch processes that run once in a long period of time (something like once a year).
If in the above scenarios, there is no visible strain on the other processes, wouldn't it be unreasonable to spend extra hours writing code to avoid cursors? In other words in certain cases the developer's time is more important than the performance of a process that has almost no impact on anything else.
In my opinion these would be some scenarios where you should seriously try to avoid using a cursor.
A stored proc called from a website that can get called very often.
A SQL job that would run multiple times a day and consume a lot of resources.
I think its very superficial to make a general statement like "cursors should never be used" without analyzing the task at hand and actually weighing it against the alternatives.
Please let me know of your thoughts.
There are several scenarios where cursors actually perform better than set-based equivalents. Running totals is the one that always comes to mind - look for Itzik's words on that (and ignore any that involve SQL Server 2012, which adds new windowing functions that give cursors a run for their money in this situation).
One of the big problems people have with cursors is that they perform slowly, they use temporary storage, etc. This is partially because the default syntax is a global cursor with all kinds of inefficient default options. The next time you're doing something with a cursor that doesn't need to do things like UPDATE...WHERE CURRENT OF (which I've been able to avoid my entire career), give it a fair shake by comparing these two syntax options:
DECLARE c CURSOR
FOR <SELECT QUERY>;
DECLARE c CURSOR
LOCAL STATIC READ_ONLY FORWARD_ONLY
FOR <SELECT QUERY>;
In fact the first version represents a bug in the undocumented stored procedure sp_MSforeachdb which makes it skip databases if the status of any database changes during execution. I subsequently wrote my own version of the stored procedure (see here) which both fixed the bug (simply by using the latter version of the syntax above) and added several parameters to control which databases would be chosen.
A lot of people think that a methodology is not a cursor because it doesn't say DECLARE CURSOR. I've seen people argue that a while loop is faster than a cursor (which I hope I've dispelled here) or that using FOR XML PATH to perform group concatenation is not performing a hidden cursor operation. Looking at the plan in a lot of cases will show the truth.
In a lot of cases cursors are used where set-based is more appropriate. But there are plenty of valid use cases where a set-based equivalent is much more complicated to write, for the optimizer to generate a plan for, both, or not possible (e.g. maintenance tasks where you're looping through tables to update statistics, calling a stored procedure for each value in a result, etc.). The same is true for a lot of big multi-table queries where the plan gets too monstrous for the optimizer to handle. In these cases it can be better to dump some of the intermediate results into a temporary structure first. The same goes for some set-based equivalents to cursors (like running totals). I've also written about the other way, where people almost always think instinctively to use a while loop / cursor and there are clever set-based alternatives that are much better.
UPDATE 2013-07-25
Just wanted to add some additional blog posts I've written about cursors, which options you should be using if you do have to use them, and using set-based queries instead of loops to generate sets:
Best Approaches for Running Totals - Updated for SQL Server 2012
What impact can different cursor options have?
Generate a Set or Sequence Without Loops: [Part 1] [Part 2] [Part 3]
The issue with cursors in SQL Server is that the engine is set-based internally, unlike other DBMS's like Oracle which are cursor-based internally. This means that when you create a cursor in SQL Server, temporary storage needs to be created and the set-based resultset needs to be copied over to the temporary cursor storage. You can see why this would be expensive right off the bat, not to mention any row-by-row processing that you might be doing on top of the cursor itself. The bottom line is that set-based processing is more efficient, and often times your cursor-based operation can be done better using a CTE or temp table.
That being said, there are cases where a cursor is probably acceptable, as you said for one-off operations. The most common use I can think of is in a maintenance plan where you may be iterating through all the databases on a server executing various maintenance tasks. As long as you limit your usage and don't design whole applications around RBAR (row-by-agonizing-row) processing, you should be fine.
In general cursors are a bad thing. However in some cases it is more practical to use a cursor and in some it is even faster to use one. A good example is a cursor through a contact table sending emails based on some criteria. (Not to open up the question if sending an email from your DBMS is a good idea - let's just assume it is for the problem at hand.) There is no way to write that set-based. You could use some trickery to come up with a set-based solution to generate dynamic SQL, but a real set-based solution does not exist.
However, a calculation involving the previous row can be done using a self join. That is usually still faster than a cursor.
In all cases you need to balance the effort involved in developing a faster solution. If nobody cares, if you process runs in 1 minute or in one hour, use what gets the job done quickest. If you are looping through a dataset that grows over time like an [orders] table, try to stay away from a cursor if possible. If you are not sure, do a performance test comparing a cursor base with a set-based solution on several significantly different data sizes.
I had always disliked cursors because of their slow performance. However, I found I didn't fully understand the different types of cursors and that in certain instances, cursors are a viable solution.
When you have a business problem that can only be solved by processing one row at a time, then a cursor is appropriate.
So to improve performance with the cursor, change the type of cursor you are using. Something I didn't know was, if you don't specify which type of cursor you are declaring, you get the Dynamic Optimistic type by default, which is the one that is the slowest for performance because it's doing lots of work under the hood. However, by declaring your cursor as a different type, say a static cursor, it has very good performance.
See these articles for a fuller explanation:
The Truth About Cursors: Part I
The Truth About Cursors: Part II
The Truth About Cursors: Part III
I think the biggest con against cursors is performance, however, not laying out a task in a set based approach would probably rank second. Third would be readability and layout of the tasks as they usually don't have a lot of helpful comments.
SQL Server is optimized to run the set based approach. You write the query to return a result set of data, like a join on tables for example, but the SQL Server execution engine determines which join to use: Merge Join, Nested Loop Join, or Hash Join. SQL Server determines the best possible joining algorithm based upon the participating columns, data volume, indexing structure, and the set of values in the participating columns. So using a set based approach is generally the best approach in performance over the procedural cursor approach.
They are necessary for things like dynamic SQL pivoting, but you should try and avoid using them whenever possible.
We are trying to optimize some of our queries.
One query is doing the following:
SELECT t.TaskID, t.Name as Task, '' as Tracker, t.ClientID, (<complex subquery>) Date,
INTO [#Gadget]
FROM task t
SELECT TOP 500 TaskID, Task, Tracker, ClientID, dbo.GetClientDisplayName(ClientID) as Client
FROM [#Gadget]
order by CASE WHEN Date IS NULL THEN 1 ELSE 0 END , Date ASC
DROP TABLE [#Gadget]
(I have removed the complex subquery. I don't think it's relevant other than to explain why this query has been done as a two stage process.)
I thought it would be far more efficient to merge this down into a single query using subqueries as:
SELECT TOP 500 TaskID, Task, Tracker, ClientID, dbo.GetClientDisplayName(ClientID)
FROM
(
SELECT t.TaskID, t.Name as Task, '' as Tracker, t.ClientID, (<complex subquery>) Date,
FROM task t
) as sub
order by CASE WHEN Date IS NULL THEN 1 ELSE 0 END , Date ASC
This would give the optimizer better information to work out what was going on and avoid any temporary tables. I assumed it should be faster.
But it turns out it is a lot slower. 8 seconds vs. under 5 seconds.
I can't work out why this would be the case, as all my knowledge of databases imply that subqueries would always be faster than using temporary tables.
What am I missing?
Edit --
From what I have been able to see from the query plans, both are largely identical, except for the temporary table which has an extra "Table Insert" operation with a cost of 18%.
Obviously as it has two queries the cost of the Sort Top N is a lot higher in the second query than the cost of the Sort in the Subquery method, so it is difficult to make a direct comparison of the costs.
Everything I can see from the plans would indicate that the subquery method would be faster.
"should be" is a hazardous thing to say of database performance. I have often found that temp tables speed things up, sometimes dramatically. The simple explanation is that it makes it easier for the optimiser to avoid repeating work.
Of course, I've also seen temp tables make things slower, sometimes much slower.
There is no substitute for profiling and studying query plans (read their estimates with a grain of salt, though).
Obviously, SQL Server is choosing the wrong query plan. Yes, that can happen, I've had exactly the same scenario as you a few times.
The problem is that optimizing a query (you mention a "complex subquery") is a non-trivial task: If you have n tables, there are roughly n! possible join orders -- and that's just the beginning. So, it's quite plausible that doing (a) first your inner query and (b) then your outer query is a good way to go, but SQL Server cannot deduce this information in reasonable time.
What you can do is to help SQL Server. As Dan Tow writes in his great book "SQL Tuning", the key is usually the join order, going from the most selective to the least selective table. Using common sense (or the method described in his book, which is a lot better), you could determine which join order would be most appropriate and then use the FORCE ORDER query hint.
Anyway, every query is unique, there is no "magic button" to make SQL Server faster. If you really want to find out what is going on, you need to look at (or show us) the query plans of your queries. Other interesting data is shown by SET STATISTICS IO, which will tell you how much (costly) HDD access your query produces.
I have re-iterated this question here: How can I force a subquery to perform as well as a #temp table?
The nub of it is, yes, I get that sometimes the optimiser is right to meddle with your subqueries as if they weren't fully self contained but sometimes it makes a bad wrong turn when it tries to be clever in a way that we're all familiar with. I'm saying there must be a way of switching off that "cleverness" where necessary instead of wrecking a View-led approach with temp tables.
I've heard that using an IN Clause can hurt performance because it doesn't use Indexes properly. See example below:
SELECT ID, Name, Address
FROM people
WHERE id IN (SELECT ParsedValue FROM UDF_ParseListToTable(#IDList))
Is it better to use the form below to get these results?
SELECT ID,Name,Address
FROM People as p
INNER JOIN UDF_ParseListToTable(#IDList) as ids
ON p.ID = ids.ParsedValue
Does this depend on which version of SQL Server you are using? If so which ones are affected?
Yes, assuming relatively large data sets.
It's considered better to use EXISTS for large data sets. I follow this and have noticed improvements in my code execution time.
According to the article, it has to do with how the IN vs. EXISTS is internalized. Another article: http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
It's very simple to find out - open Management studio, put both versions of the query in, then run with the Show Execution plan turned on. Compare the two execution plans. Often, but not always, the query optimizer will make the same exact plan / literally do the same thing for different versions of a query that are logically equivalent.
In fact, that's its purpose - the goal is that the optimizer would take ANY version of a query, assuming the logic is the same, and make an optimal plan. Alas, the process isn't perfect.
Here's one scientific comparison:
http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/
http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/
IN can hurt performance because SQL Server must generate a complete result set and then create potentially a huge IF statement, depending on the number of rows in the result set. BTW, calling a UDF can be a real performance hit as well. They are very nice to use but can really impact performance, if you are not careful. You can Google UDF and Performance to do some research on this.
More than the IN or the Table Variable, I would think that proper use of an Index would increase the performance of your query.
Also, from the table name, it does not seem like you are going to have a lot of entries in it so which way you go may be moot point in this particular example.
Secondly, IN will be evaluated only once since there is no subquery. In your case, the #IDList variable is probably going to cause mistmatches you will need #IDList1, #IDList2, #IdList3.... because IN demands a list.
As a general rule of thumb, you should avoid IN with subqueries and use EXISTS with a join - you will get better performance more often than not.
Your first example is not the same as your second example, because WHERE X IN (#variable) is the same as WHERE X = #variable (i.e. you cannot have variable lists).
Regarding performance, you'll have to look at the execution plans to see what indexes are chosen.