Does SELECT DISTINCT differ from SELECT when using a NOT IN clause? - sql-server

My
DELETE FROM FOO
WHERE [FOO_KEY] NOT IN
(
SELECT [FOO_KEY] FROM BAR
)
query is running shockingly slow. I know that BAR is a very big table, so I'm tempted to write
DELETE FROM FOO
WHERE [FOO_KEY] NOT IN
(
SELECT DISTINCT [FOO_KEY] FROM BAR
)
but I remember being told that:
When NULLs aren't a problem (and they're not here) there's hardly any difference between IN and EXISTS.
When using EXISTS, you don't need to use SELECT DISTINCT and there is no performance reason to do so.
This leaves me with good reason to believe that it is absolutely guaranteed that adding DISTINCT here will not make a difference. Is that correct?

From a functional point of view, the queries with or without DISTINCT are identical (they would delete the same set of rows).
From a performance point of view, I am certain that SQL Server will always produce the same execution plan for both queries (but I cannot prove this).
For other database engines, this may be different. See:
https://mariadb.com/kb/en/optimizing-group-by/
https://www.quora.com/Should-I-use-DISTINCT-in-a-subquery-when-using-IN
https://docs.oracle.com/javadb/10.8.3.0/tuning/ctuntransform867165.html

Related

SQL Server : Tables vs Cursors

I'm asking for a high level understanding of what these two things are.
From what I've read, it seems that in general, a query with an ORDER BY clause returns a cursor, and basically cursors have order to them whereas tables are literally a set where order is not guaranteed.
What I don't really understand is, why are these two things talked about like two separate animals. To me, it seems like cursors are a subset of tables. The book I'm reading vaguely mentioned that
"Some language elements and operations in SQL expect to work with
table results of queries and not with cursors; examples include table
expressions and set operators"
My question would be... why not? Why won't SQL handle it like a table anyways even if it's given an ordered set?
Just to clarify, I will type out the paragraph from the book:
A query with an ORDER BY clause results in what standard SQL calls a cursor - a nonrelational result with order guaranteed among rows. You're probably wondering why it matters whether a query returns a table result or a cursor. Some language elements and operations in SQL expect to work with table results of queries and not with cursors; examples include table expressions and set operators..."
A table is a result set. It has columns and rows. You can join to it with other tables to either filter or combine the data in ONE operation:
SELECT *
FROM TABLE1 T1
JOIN TABLE2 T2
ON T1.PK = T2.PK
A cursor is a variable that stores a result set. It has columns, but the rows are inaccessible - except the top one! You can't access the records directly, rather you must fetch them ONE ROW AT A TIME.
DECLARE TESTCURSOR CURSOR
FOR SELECT * FROM Table1
OPEN TESTCURSOR
FETCH NEXT FROM TESTCURSOR
You can also fetch them into variables, if needed, for more advanced processing.
Please let me know if that doesn't clarify it for you.
With regard to this sentence,
"Some language elements and operations in SQL expect to work with
table results of queries and not with cursors; examples include table
expressions and set operators"
I think the author is just saying that there are cases where it doesn't make sense to use an ORDER BY in a fragment of a query, because the ORDER BY should be on the outer query, where it will actually affect the final result of the query.
For instance, I can't think of any point in putting an ORDER BY on a CTE ("table expression") or on the Subquery in an IN( ) expression. UNLESS (in both cases) a TOP n was used as well.
When you create a VIEW, SQL Server will actually not allow you to use an ORDER BY unless a TOP n is also used. Otherwise the ORDER BY should be specified when Selecting from the VIEW, not in the code of the VIEW itself.

How can I force a subquery to perform as well as a #temp table?

I am re-iterating the question asked by Mongus Pong Why would using a temp table be faster than a nested query? which doesn't have an answer that works for me.
Most of us at some point find that when a nested query reaches a certain complexity it needs to broken into temp tables to keep it performant. It is absurd that this could ever be the most practical way forward and means these processes can no longer be made into a view. And often 3rd party BI apps will only play nicely with views so this is crucial.
I am convinced there must be a simple queryplan setting to make the engine just spool each subquery in turn, working from the inside out. No second guessing how it can make the subquery more selective (which it sometimes does very successfully) and no possibility of correlated subqueries. Just the stack of data the programmer intended to be returned by the self-contained code between the brackets.
It is common for me to find that simply changing from a subquery to a #table takes the time from 120 seconds to 5. Essentially the optimiser is making a major mistake somewhere. Sure, there may be very time consuming ways I could coax the optimiser to look at tables in the right order but even this offers no guarantees. I'm not asking for the ideal 2 second execute time here, just the speed that temp tabling offers me within the flexibility of a view.
I've never posted on here before but I have been writing SQL for years and have read the comments of other experienced people who've also just come to accept this problem and now I would just like the appropriate genius to step forward and say the special hint is X...
There are a few possible explanations as to why you see this behavior. Some common ones are
The subquery or CTE may be being repeatedly re-evaluated.
Materialising partial results into a #temp table may force a more optimum join order for that part of the plan by removing some possible options from the equation.
Materialising partial results into a #temp table may improve the rest of the plan by correcting poor cardinality estimates.
The most reliable method is simply to use a #temp table and materialize it yourself.
Failing that regarding point 1 see Provide a hint to force intermediate materialization of CTEs or derived tables. The use of TOP(large_number) ... ORDER BY can often encourage the result to be spooled rather than repeatedly re evaluated.
Even if that works however there are no statistics on the spool.
For points 2 and 3 you would need to analyse why you weren't getting the desired plan. Possibly rewriting the query to use sargable predicates, or updating statistics might get a better plan. Failing that you could try using query hints to get the desired plan.
I do not believe there is a query hint that instructs the engine to spool each subquery in turn.
There is the OPTION (FORCE ORDER) query hint which forces the engine to perform the JOINs in the order specified, which could potentially coax it into achieving that result in some instances. This hint will sometimes result in a more efficient plan for a complex query and the engine keeps insisting on a sub-optimal plan. Of course, the optimizer should usually be trusted to determine the best plan.
Ideally there would be a query hint that would allow you to designate a CTE or subquery as "materialized" or "anonymous temp table", but there is not.
Another option (for future readers of this article) is to use a user-defined function. Multi-statement functions (as described in How to Share Data between Stored Procedures) appear to force the SQL Server to materialize the results of your subquery. In addition, they allow you to specify primary keys and indexes on the resulting table to help the query optimizer. This function can then be used in a select statement as part of your view. For example:
CREATE FUNCTION SalesByStore (#storeid varchar(30))
RETURNS #t TABLE (title varchar(80) NOT NULL PRIMARY KEY,
qty smallint NOT NULL) AS
BEGIN
INSERT #t (title, qty)
SELECT t.title, s.qty
FROM sales s
JOIN titles t ON t.title_id = s.title_id
WHERE s.stor_id = #storeid
RETURN
END
CREATE VIEW SalesData As
SELECT * FROM SalesByStore('6380')
Having run into this problem, I found out that (in my case) SQL Server was evaluating the conditions in incorrect order, because I had an index that could be used (IDX_CreatedOn on TableFoo).
SELECT bar.*
FROM
(SELECT * FROM TableFoo WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
I managed to work around it by forcing the subquery to use another index (i.e. one that would be used when the subquery was executed without the parent query). In my case I switched to PK, which was meaningless for the query, but allowed the conditions from the subquery to be evaluated first.
SELECT bar.*
FROM
(SELECT * FROM TableFoo WITH (INDEX([PK_Id]) WHERE Deleted = 1) foo
JOIN TableBar bar ON (bar.FooId = foo.Id)
WHERE
foo.CreatedOn > DATEADD(DAY, -7, GETUTCDATE())
Filtering by the Deleted column was really simple and filtering the few results by CreatedOn afterwards was even easier. I was able to figure it out by comparing the Actual Execution Plan of the subquery and the parent query.
A more hacky solution (and not really recommended) is to force the subquery to get executed first by limiting the results using TOP, however this could lead to weird problems in the future if the results of the subquery exceed the limit (you could always set the limit to something ridiculous). Unfortunately TOP 100 PERCENT can't be used for this purpose since SQL Server just ignores it.

Eliminating code duplication when querying multiple tables with the same schema's

I've inherited some code which uses multiple tables to store the same information depending on how old it is (one for the current day, the last month, etc.).
Currently most of the code is duplicated for every condition, and I'd like to try and eliminate the majority of the duplication in the stored procedures. Right now re-architecting the design is not an option as there are a number of applications that depend on the current design that I have no control over.
One option I've tried so far is loading the needed data into a temp table which I found to have a rather large performance hit. I've also tried using a cte structured like this:
;WITH cte_table(...)
AS
(
SELECT ...
FROM a
WHERE #queried_date = CONVERT(DATE, GETDATE())
UNION ALL
SELECT ...
FROM b
WHERE #queried_date BETWEEN --some range
)
This works and the performance isn't terrible, but it's not very nice looking.
Could anyone offer a better alternative?
Two suggestions:
Just use UNION, not UNION ALL. The UNION operator removes duplicates in that case. UNION ALL preserves dupes.
Using the CTE, the SELECT clause on the outside / end can have a DISTICT operator to bring back unique rows. Of course, not sure why you'd be using a CTE in this scenario since UNION should work just fine. (In fact, I believe SQL will optimize the query to the same plan structure either way...)
Any way you slice it, if you have duplicate data, either you have to do something like the above, or you have to make explicit clauses that remove dupe cases, using things like #temp tables or WHERE ... NOT IN ().

Performance characteristics of T-SQL CTEs

I've got some SQL that looks roughly like this:
with InterestingObjects(ObjectID, OtherInformation, Whatever) as (
select X.ObjectID, Y.OtherInformation, Z.Whatever
from X join Y join Z -- abbreviated for brevity
)
-- ...long query follows, which uses InterestingObjects in several more CTEs,
-- and then uses those CTEs in a select statement at the end.
When I run it, I can see in the execution plan that it appears to be running the query in the CTE basically every single time the CTE is referenced. If I instead create a temp table #InterestingObjects and use it, of course, it runs the query once, puts the result in the temp table, and queries that from then on. In my particular instance, that makes the whole thing run much faster.
My question is: Is this always what I can expect from CTEs (not memoizing the results in any way, just as if it were inlining the query everywhere?) Is there a reason that SQL Server could not optimize this better? Usually I am in awe at how smart the optimizer is, but I'm surprised that it couldn't figure this out.
(edit: BTW, I'm running this on SQL Server '08 R2.)
CTE's can be better or worse, just depending on how they're used (involving concepts of recursion, indexing, etc.). You might find this article interesting: http://www.sqlservercentral.com/articles/T-SQL/2926/

Use of With Clause in SQL Server

How does with clause work in SQL Server? Does it really give me some performance boost or does it just help to make more readable scripts?
When it is right to use it? What should you know about with clause before you start to use it?
Here's an example of what I'm talking about:
http://www.dotnetspider.com/resources/33984-Use-With-Clause-Sql-Server.aspx
I'm not entirely sure about performance advantages, but I think it can definitely help in the case where using a subquery results in the subquery being performed multiple times.
Apart from that it can definitely make code more readable, and can also be used in the case where multiple subqueries would be a cut and paste of the same code in different places.
What should you know before you use it?
A big downside is that when you have a CTE in a view, you cannot create a clustered index on that view. This can be a big pain because SQL Server does not have materialised views, and has certainly bitten me before.
Unless you use recursive abilities, a CTE is not better performance-wise than a simple inline view.
It just saves you some typing.
The optimizer is free to decide whether to reevaluate it or not, when it's being reused, and it most cases it decides to reevaluate:
WITH q (uuid) AS
(
SELECT NEWID()
)
SELECT *
FROM q
UNION ALL
SELECT *
FROM q
will return you two different NEWIDs.
Note that other engines may behave differently.
PostgreSQL, unlike SQL Server, materializes the CTEs.
Oracle supports a special hint, /*+ MATERIALIZE */, that tells the optimizer whether it should materialize the CTE or not.
with is a keyword in SQL which just stores the temporary result in a temporary table. Example:
with a(--here a is the temporary table)
(id)(--id acts as colomn for table a )
as(select colomn_name from table_name )
select * from a

Resources