How does with clause work in SQL Server? Does it really give me some performance boost or does it just help to make more readable scripts?
When it is right to use it? What should you know about with clause before you start to use it?
Here's an example of what I'm talking about:
http://www.dotnetspider.com/resources/33984-Use-With-Clause-Sql-Server.aspx
I'm not entirely sure about performance advantages, but I think it can definitely help in the case where using a subquery results in the subquery being performed multiple times.
Apart from that it can definitely make code more readable, and can also be used in the case where multiple subqueries would be a cut and paste of the same code in different places.
What should you know before you use it?
A big downside is that when you have a CTE in a view, you cannot create a clustered index on that view. This can be a big pain because SQL Server does not have materialised views, and has certainly bitten me before.
Unless you use recursive abilities, a CTE is not better performance-wise than a simple inline view.
It just saves you some typing.
The optimizer is free to decide whether to reevaluate it or not, when it's being reused, and it most cases it decides to reevaluate:
WITH q (uuid) AS
(
SELECT NEWID()
)
SELECT *
FROM q
UNION ALL
SELECT *
FROM q
will return you two different NEWIDs.
Note that other engines may behave differently.
PostgreSQL, unlike SQL Server, materializes the CTEs.
Oracle supports a special hint, /*+ MATERIALIZE */, that tells the optimizer whether it should materialize the CTE or not.
with is a keyword in SQL which just stores the temporary result in a temporary table. Example:
with a(--here a is the temporary table)
(id)(--id acts as colomn for table a )
as(select colomn_name from table_name )
select * from a
Related
My
DELETE FROM FOO
WHERE [FOO_KEY] NOT IN
(
SELECT [FOO_KEY] FROM BAR
)
query is running shockingly slow. I know that BAR is a very big table, so I'm tempted to write
DELETE FROM FOO
WHERE [FOO_KEY] NOT IN
(
SELECT DISTINCT [FOO_KEY] FROM BAR
)
but I remember being told that:
When NULLs aren't a problem (and they're not here) there's hardly any difference between IN and EXISTS.
When using EXISTS, you don't need to use SELECT DISTINCT and there is no performance reason to do so.
This leaves me with good reason to believe that it is absolutely guaranteed that adding DISTINCT here will not make a difference. Is that correct?
From a functional point of view, the queries with or without DISTINCT are identical (they would delete the same set of rows).
From a performance point of view, I am certain that SQL Server will always produce the same execution plan for both queries (but I cannot prove this).
For other database engines, this may be different. See:
https://mariadb.com/kb/en/optimizing-group-by/
https://www.quora.com/Should-I-use-DISTINCT-in-a-subquery-when-using-IN
https://docs.oracle.com/javadb/10.8.3.0/tuning/ctuntransform867165.html
I am re-iterating the question asked by Mongus Pong Why would using a temp table be faster than a nested query? which doesn't have an answer that works for me.
Most of us at some point find that when a nested query reaches a certain complexity it needs to broken into temp tables to keep it performant. It is absurd that this could ever be the most practical way forward and means these processes can no longer be made into a view. And often 3rd party BI apps will only play nicely with views so this is crucial.
I am convinced there must be a simple queryplan setting to make the engine just spool each subquery in turn, working from the inside out. No second guessing how it can make the subquery more selective (which it sometimes does very successfully) and no possibility of correlated subqueries. Just the stack of data the programmer intended to be returned by the self-contained code between the brackets.
It is common for me to find that simply changing from a subquery to a #table takes the time from 120 seconds to 5. Essentially the optimiser is making a major mistake somewhere. Sure, there may be very time consuming ways I could coax the optimiser to look at tables in the right order but even this offers no guarantees. I'm not asking for the ideal 2 second execute time here, just the speed that temp tabling offers me within the flexibility of a view.
I've never posted on here before but I have been writing SQL for years and have read the comments of other experienced people who've also just come to accept this problem and now I would just like the appropriate genius to step forward and say the special hint is X...
There are a few possible explanations as to why you see this behavior. Some common ones are
The subquery or CTE may be being repeatedly re-evaluated.
Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.
The most reliable method is simply to use a #temp table and materialize it yourself.
Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.
Even if that works however there are no statistics on the spool.
For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.
I do not believe there is a query hint that instructs the engine to spool each subquery in turn.
There is the OPTION (FORCE ORDER) query hint which forces the engine to perform the JOINs in the order specified, which could potentially coax it into achieving that result in some instances. This hint will sometimes result in a more efficient plan for a complex query and the engine keeps insisting on a sub-optimal plan. Of course, the optimizer should usually be trusted to determine the best plan.
Ideally there would be a query hint that would allow you to designate a CTE or subquery as "materialized" or "anonymous temp table", but there is not.
Another option (for future readers of this article) is to use a user-defined function. Multi-statement functions (as described in How to Share Data between Stored Procedures) appear to force the SQL Server to materialize the results of your subquery. In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer. This function can then be used in a select statement as part of your view. For example:
CREATE FUNCTION SalesByStore (#storeid varchar(30))
RETURNS #t TABLE (title varchar(80) NOT NULL PRIMARY KEY,
qty smallint NOT NULL) AS
BEGIN
INSERT #t (title, qty)
SELECT t.title, s.qty
FROM sales s
JOIN titles t ON t.title_id = s.title_id
WHERE s.stor_id = #storeid
RETURN
END
CREATE VIEW SalesData As
SELECT * FROM SalesByStore('6380')
Having run into this problem, I found out that (in my case) SQL Server was evaluating the conditions in incorrect order, because I had an index that could be used (IDX_CreatedOn on TableFoo).
SELECT bar.*
FROM
(SELECT * FROM TableFoo WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
I managed to work around it by forcing the subquery to use another index (i.e. one that would be used when the subquery was executed without the parent query). In my case I switched to PK, which was meaningless for the query, but allowed the conditions from the subquery to be evaluated first.
SELECT bar.*
FROM
(SELECT * FROM TableFoo WITH (INDEX([PK_Id]) WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
Filtering by the Deleted column was really simple and filtering the few results by CreatedOn afterwards was even easier. I was able to figure it out by comparing the Actual Execution Plan of the subquery and the parent query.
A more hacky solution (and not really recommended) is to force the subquery to get executed first by limiting the results using TOP, however this could lead to weird problems in the future if the results of the subquery exceed the limit (you could always set the limit to something ridiculous). Unfortunately TOP 100 PERCENT can't be used for this purpose since SQL Server just ignores it.
I've inherited some code which uses multiple tables to store the same information depending on how old it is (one for the current day, the last month, etc.).
Currently most of the code is duplicated for every condition, and I'd like to try and eliminate the majority of the duplication in the stored procedures. Right now re-architecting the design is not an option as there are a number of applications that depend on the current design that I have no control over.
One option I've tried so far is loading the needed data into a temp table which I found to have a rather large performance hit. I've also tried using a cte structured like this:
;WITH cte_table(...)
AS
(
SELECT ...
FROM a
WHERE #queried_date = CONVERT(DATE, GETDATE())
UNION ALL
SELECT ...
FROM b
WHERE #queried_date BETWEEN --some range
)
This works and the performance isn't terrible, but it's not very nice looking.
Could anyone offer a better alternative?
Two suggestions:
Just use UNION, not UNION ALL. The UNION operator removes duplicates in that case. UNION ALL preserves dupes.
Using the CTE, the SELECT clause on the outside / end can have a DISTICT operator to bring back unique rows. Of course, not sure why you'd be using a CTE in this scenario since UNION should work just fine. (In fact, I believe SQL will optimize the query to the same plan structure either way...)
Any way you slice it, if you have duplicate data, either you have to do something like the above, or you have to make explicit clauses that remove dupe cases, using things like #temp tables or WHERE ... NOT IN ().
I've got some SQL that looks roughly like this:
with InterestingObjects(ObjectID, OtherInformation, Whatever) as (
select X.ObjectID, Y.OtherInformation, Z.Whatever
from X join Y join Z -- abbreviated for brevity
)
-- ...long query follows, which uses InterestingObjects in several more CTEs,
-- and then uses those CTEs in a select statement at the end.
When I run it, I can see in the execution plan that it appears to be running the query in the CTE basically every single time the CTE is referenced. If I instead create a temp table #InterestingObjects and use it, of course, it runs the query once, puts the result in the temp table, and queries that from then on. In my particular instance, that makes the whole thing run much faster.
My question is: Is this always what I can expect from CTEs (not memoizing the results in any way, just as if it were inlining the query everywhere?) Is there a reason that SQL Server could not optimize this better? Usually I am in awe at how smart the optimizer is, but I'm surprised that it couldn't figure this out.
(edit: BTW, I'm running this on SQL Server '08 R2.)
CTE's can be better or worse, just depending on how they're used (involving concepts of recursion, indexing, etc.). You might find this article interesting: http://www.sqlservercentral.com/articles/T-SQL/2926/
Which of the following query is better... This is just an example, there are numerous situations, where I want the user name to be displayed instead of UserID
Select EmailDate, B.EmployeeName as [UserName], EmailSubject
from Trn_Misc_Email as A
inner join
Mst_Users as B on A.CreatedUserID = B.EmployeeLoginName
or
Select EmailDate, GetUserName(CreatedUserID) as [UserName], EmailSubject
from Trn_Misc_Email
If there is no performance benefit in using the First, I would prefer using the second... I would be having around 2000 records in User Table and 100k records in email table...
Thanks
A good question and great to be thinking about SQL performance, etc.
From a pure SQL point of view the first is better. In the first statement it is able to do everything in a single batch command with a join. In the second, for each row in trn_misc_email it is having to run a separate BATCH select to get the user name. This could cause a performance issue now, or in the future
It is also eaiser to read for anyone else coming onto the project as they can see what is happening. If you had the second one, you've then got to go and look in the function (I'm guessing that's what it is) to find out what that is doing.
So in reality two reasons to use the first reason.
The inline SQL JOIN will usually be better than the scalar UDF as it can be optimised better.
When testing it though be sure to use SQL Profiler to view the cost of both versions. SET STATISTICS IO ON doesn't report the cost for scalar UDFs in its figures which would make the scalar UDF version appear better than it actually is.
Scalar UDFs are very slow, but the inline ones are much faster, typically as fast as joins and subqueries
BTW, you query with function calls is equivalent to an outer join, not to an inner one.
To help you more, just a tip, in SQL server using the Managment studio you can evaluate the performance by Display Estimated execution plan. It shown how the indexs and join works and you can select the best way to use it.
Also you can use the DTA (Database Engine Tuning Advisor) for more info and optimization.