Using 'where then Union' or Using 'Union then Where' - sql-server

Please have these two types of query in your mind:
--query1
Select someFields
From someTables
Where someWhereClues
Union all
Select someFields
FROM some Tables
Where someWhereClues
--query2
Select * FROM (
Select someFields
From someTables
Union all
Select someFields
FROM someTables
) DT
Where someMixedWhereClues
Note :
In both queries final result fields are same
I thought the 1st. query is faster or its performance is better!
But after some researches I confused by this note:
SQL Server (as a sample of RDBMS) first reads whole data then seek records. => so in both queries all records will read and seek.
Please Help me on my misunderstandings, and on if there is any other differences between query1 and query2 ?
Edit: adding sample plans:
select t.Name, t.type from sys.tables t where t.type = 'U'
union all
select t.Name, t.type from sys.objects t where t.type = 'U'
select * from (
select t.Name, t.type from sys.tables t
union all
select t.Name, t.type from sys.objects t
) dt
where dt.type = 'U'
Execution Plans are:
both are same and 50%

The SQL Server query optimizer, optimizes both queries so you get nearly the same performance.

The first one cannot be slower. Here is the reasoning:
If the WHERE clauses in the first can efficiently use an INDEX, there will be fewer rows to collect together in the UNION. Fewer rows --> faster.
The second one does not have an INDEX on the UNION, hence the WHERE cannot be optimized in that way.
Here are things that could lead to the first being slower. But I see them as exceptions, not the rule.
A different amount of parallelism is achieved.
Different stuff happens to be cached at the time you run the queries.
Caveat: I am assuming all three WHERE clauses are identical (as your example shows).

As a rule of thumb, I will always consider using the first type of the query.
In made-up samples and queries with simple WHERE predicates both will use the same plan. But in a more complex query, with more complicated predicates, the optimizer might not come up with an equally efficient solution for the second type of query (it's just an optimizer, and is bound by resource and time constraints). The more complex the query is, the less chance is the optimizer finds the best execution plan (as it will eventually time-out and choose the least worst plan found so far). And it gets even worse if the predicates are ORed.

SQLServer will optimize both of those queries down to the same thing, as shown in the execution plans you posted. It's able to do this because in this case the queries are fairly simple; in another case it's possible for it to turn out differently. As long as you're composing a query, you should try to follow the same general rules that the optimizer does, and filter as soon as possible to limit the resultset that returns. By telling it that you first want to only get 'U' records, and then combine those results, you will prepare the query for later revisions which could invalidate the optimizer's choices which led to the same execution plan.
In short, you don't have to force simple queries to be optimal, but it's a good habit to have, and it will help when creating more complex queries.

In my practice 1st option was never slower than the 2nd. I think that optimizer is smart enough to optimize these plans more or less in the same manner. However I made some tests and the 1st option was always better. For example:
CREATE TABLE #a ( a INT, b INT );
WITH Numbers ( I ) AS (
SELECT 1000
UNION ALL
SELECT I + 1
FROM Numbers
WHERE I < 5000
)
INSERT INTO #a ( a )
SELECT I
FROM Numbers
ORDER BY CRYPT_GEN_RANDOM(4)
OPTION ( MAXRECURSION 0 );
WITH Numbers ( I ) AS (
SELECT 1000
UNION ALL
SELECT I + 1
FROM Numbers
WHERE I < 5000
)
INSERT INTO #a ( b )
SELECT I
FROM Numbers
ORDER BY CRYPT_GEN_RANDOM(4)
OPTION ( MAXRECURSION 0 );
SELECT a, b
FROM #a
WHERE a IS NOT NULL
UNION ALL
SELECT a, b
FROM #a
WHERE b IS NOT NULL
SELECT *
FROM (
SELECT a, b
FROM #a
UNION ALL
SELECT a, b
FROM #a
) c
WHERE a IS NOT NULL
OR b IS NOT NULL
The result is 47% vs 53%

In my experience, there is no straightforward answer to this and it varies based on the nature of the underlying query. As you have shown, the optimizer comes up with the same execution plan in both of those scenarios, however that is not always the case. The performance is usually similar, but sometimes the performance can vary drastically depending on the query. In general I only take a closer look at it when performance is bad for no good reason.

Related

SQL Server - UNION with WHERE clause outside is extremely slow on simple join

I have a simple query and it works fast (<1sec):
;WITH JointIncomingData AS
(
SELECT A, B, C, D FROM dbo.table1
UNION ALL
SELECT A, B, C, D FROM dbo.table2
)
SELECT *
FROM JointIncomingData D
WHERE a = '1/1/2020'
However, if I join with another small table in the final SELECT statement it is extremely slow (> 30 sec)
DECLARE #anotherTable TABLE (A DATE, B INT)
INSERT INTO #anotherTable (AsOfDate, FundId)
VALUES ('1/1/2020', 1)
;WITH JointIncomingData AS
(
SELECT A, B, C, D FROM dbo.table1
UNION ALL
SELECT A, B, C, D FROM dbo.table2
)
SELECT *
FROM JointIncomingData D
JOIN #anotherTable T ON T.A = D.A AND T.B = D.B
In the real application, I have a complex UPDATE as the final statement, so I try to avoid copy-paste and introduces UNION to consolidate code.
But now experience an unexpected issue with slowness.
I tried using UNION ALL instead of UNION - with the same result.
Looks like SQL Server pushed simple conditions to each of UNION statements, but when I join it with another table, it doesn't happen and a table scan occurs.
Any advice?
UPDATE: Here is estimated plans
for the first simple condition query: https://www.brentozar.com/pastetheplan/?id=SJ5fynTgP
for the query with a join table: https://www.brentozar.com/pastetheplan/?id=H1eny3pxP
Please keep in mind that estimated plans are not exactly for the above query, but more real one, having exactly the same problem.
When I'm doing complex updates I normally declare a temp table and insert the rows into it that I intend to update. There's two benefits to this approach, one being that by explicitly collecting the rows to be updated you simplify the logic and make the update itself really simple (just update the rows whose primary key is in your temp table). The other big benefit of it is you can do some sanity checking before actually running your update, and "throw an error" by returning a different value.
I think it's generally a good practice to break down queries into simple steps like this, because it makes them much easier to troubleshoot in the future.
Based on the "similar" execution plan you shared. It would also be better to have the actual plan, to know if your estimates and memory grants are ok.
Key lookup
The index IX_dperf_date_fund should be extended to INCLUDE the following columns nav, equity
Why? Every row the index returns, create a lookup in the clusterd index to retrieve the column values of nav, equity.
Only if this is reasonable for the application, if other queries may benefit as well
CTE
Change your CTE to a temp table.
Example:
SELECT *
INTO #JointIncomingData
FROM (
SELECT AsOfDate, FundId, DataSourceId, ShareClass, NetAssetsBase, SharesOutstanding
FROM
ETL.tblIncomingData
UNION ALL
SELECT AsOfDate, FundId, DataSourceId, ShareClass, NetAssetsBase, SharesOutstanding
FROM ETL.vIncomingDataDPerf
) x
Why? CTE's are not materialized. and this answer
Bonus: parameter sniffing
If you pass in parameters you might be suffering from parameters sniffing.

Why is a join ON false much slower that a join with an ON condition that refers the columns but always evaluates to false in Snowflake?

This two queries have the same result but very different execution times, in both cases the ON clause always evaluate to false. In the first query there is a explicit ON false and the second query the ON t1.c1 = t2.c2 will always evaluate to false as well.
-- query 1
with t1 as (
select seq4()*2 as c1 from table(generator(rowcount => 1000000))
)
,t2 as (
select (seq4()*2)+1 as c2 from table(generator(rowcount => 1000000))
)
select * from t1 FULL JOIN t2 ON false; -- takes 16 minutes on a small warehouse
--query 2
with t1 as (
select seq4()*2 as c1 from table(generator(rowcount => 1000000))
)
,t2 as (
select (seq4()*2)+1 as c2 from table(generator(rowcount => 1000000))
)
select * from t1 FULL JOIN t2 ON t1.c1 = t2.c2 -- Instantaneous , same results
;
According to the snowflake profiler the only difference is that in query 1 we get a FULL OUTER join node with Additional Join Condition 1=0 and in query 2 we get a FULL OUTER join node with Equality Join Condition SYS_VW.C1_0 = SYS_VW.C2_0.
I guess that query 1 is really doing a CROSS JOIN first (1000000 * 1000000 = 10ˆ12 rows) and then filtering that. While query2 is doing just doing a UNION of sorts evaluating only 1000000+1000000 = 2M rows.
But the question is: why? I mean, it's this behaviour specified / required by SQL in general or it's just a miss from the Snowflake query planner / optimizer ?
After consulting with Snowflake support (case 96930) I got the takeaway points
ON FALSE is not a syntax they support right now.
Snowflake don't interpret the keyword FALSE like other SQL languages and it's reserved for outside of the JOIN clauses. That is way it gets translated to 1=0
Since it's not supported, this is why query1 results in 56 more optimization steps that query2.
Fortunately the support engineer agrees that there is an opportunity to have this conditions ON FALSE , ON t1.c1=t2.c2 to be handled exactly the same and he will deliver the suggestion to the engineering deparment.
So, in short, ON FALSE is not supported although it doesn't produce an error. They recommend using a proper ON clause for all JOINs except CROSS JOIN. And maybe in the future they will recognize ON FALSE and optimize it away.
This does seem to be a miss by Snowflake's query planner. I don't see any docs specific to this example though. Off the top of my head, based on experience in other dbms, my theory here is that it has to do with sargability.
A quick Google search on "sargable" will do more for you than an answer here. But in short, predicates that can take advantage of an index are considered sargable. Most dbms have some situations where a predicate will obviously not interfere with an index (your case is a good example), but the optimizer won't have that specific situation coded for and will then decide "well I'm not sure if an index can still be used, so I'll assume they can't and do this the long way around"
So I'm wondering if something similar is happening here since Snowflake does some different things under the hood for optimizing and "indexing" than most other systems. In your example I would guess that in case 2, it's able to determine that it has two sorted lists of numbers and just has to run through the two lists in order. Whereas in the first list it decides, "I have two sorted lists of numbers, but that isn't relevant to my join predicate... better compare each row to every other row and check the predicate each time"
I'd recommend sending this in to Snowflake in a support ticket.

SQL Server and intermediate materialization?

After reading this interesting article about intermediate materialization - I still have some questions.
I have this query :
SELECT *
FROM ...
WHERE isnumeric(MyCol)=1 and ( CAST( MyCol AS int)>1)
However, the where clause order is not deterministic.
So I might get exception here.( if he first tries to cast "k1k1" )
I assume this will solve the problem
SELECT MyCol
FROM
(SELECT TOP 100 PERCENT foo From MyTable WHERE ISNUMERIC (MyCol ) > 1 ORDER BY MyCol ) bar
WHERE
CAST(MyCol AS int) > 100
why does putting top 100 + order will change VS my regular query ?
I read in the comments :
(the "intermediate" result -- in other words, a result obtained during
the process, that will be used to calculate the final result) will be
physically stored ("materialized") in TempDB and used from there for
the remainder of the user, instead of being queried back from the base
tables.
what difference does it makes if it is stored in tempDB or queried back from the base tables? it is the same data !
The supported way to avoid errors due to the optimizer reorganizing things is to use CASE:
SELECT *
FROM YourTable
WHERE
1 <=
CASE
WHEN aa NOT LIKE '%[^0-9]%'
THEN CONVERT(int, aa)
ELSE 0
END;
Intermediate materialization is not a supported technique, so it should only be employed by very expert users in special circumstances where the risks are understood and accepted.
TOP 100 PERCENT is generally ignored by the optimizer in SQL Server 2005 onward.
By adding the TOP clause into the inner query, you're forcing SQL Server to run that query first before it runs the outer query - thereby discarding all rows for which ISNUMERIC returns false.
Without the TOP clause, the optimiser can rewrite the query to be the same as your first query.

Why is this non-correlated query so slow?

I have this query...
SELECT Distinct([TargetAttributeID]) FROM
(SELECT distinct att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
union all
SELECT distinct ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate) x
Execution Plan for the above query
The two inner distincts are looking at 32 and 10,000 rows respectively. This query returns 5 rows and executes in under 1 second.
If I then use the result of this query as the subject of an IN like so...
SELECT attx.intAttributeID,attx.txtAttributeName,attx.txtAttributeLabel,attx.txtType,attx.txtEntity FROM
AST_tblAttributes attx WHERE attx.intAttributeID
IN
(SELECT Distinct([TargetAttributeID]) FROM
(SELECT Distinct att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
union all
SELECT Distinct ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate) x)
Execution Plan for the above query
Then it takes over 3 minutes! If I just take the result of the query and perform the IN "manually" then again it comes back extremely quickly.
However if I remove the two inner DISTINCTS....
SELECT attx.intAttributeID,attx.txtAttributeName,attx.txtAttributeLabel,attx.txtType,attx.txtEntity FROM
AST_tblAttributes attx WHERE attx.intAttributeID
IN
(SELECT Distinct([TargetAttributeID]) FROM
(SELECT att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
union all
SELECT ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate) x)
Execution Plan for the above query
..then it comes back in under a second.
What is SQL Server thinking? Can it not figure out that it can perform the two sub-queries and use the result as the subject of the IN. It seems as slow as a correlated sub-query, but it isn't correlated!!!
In Show Estimate Execution plan there are three Clustered Index Scans each with a cost of 100%! (Execution Plan is here)
Can anyone tell me why the inner DISTINCTS make this query so much slower (but only when used as the subject of an IN...) ?
UPDATE
Sorry it's taken me a while to get these execution plans up...
Query 1
Query 2 (The slow one)
Query 3 - No Inner Distincts
Honestly I think it comes down to the fact that, in terms of relational operators, you have a gratuitously baroque query there, and SQL Server stops searching for alternate execution plans within the time it allows itself to find one.
After the parse and bind phase of plan compilation, SQL Server will apply logical transforms to the resulting tree, estimate the cost of each, and choose the one with the lowest cost. It doesn't exhaust all possible transformations, just as many as it can compute within a given window. So presumably, it has burned through that window before it arrives at a good plan, and it's the addition of the outer semi-self-join on AST_tblAttributes that pushed it over the edge.
How is it gratuitously baroque? Well, first off, there's this (simplified for noise reduction):
select distinct intAttributeID from (
select distinct intAttributeID from AST_tblAttributes ....
union all
select distinct intAttributeID from AST_tblAttributes ....
)
Concatenating two sets, and projecting the unique elements? Turns out there's operator for that, it's called UNION. So given enough time during plan compilation and enough logical transformations, SQL Server will realize what you really mean is:
select intAttributeID from AST_tblAttributes ....
union
select intAttributeID from AST_tblAttributes ....
But wait, you put this in a correlated subquery. Well, a correlated subquery is a semi-join, and the right relation does not require logical dedupping in a semi-join. So SQL Server may logically rewrite the query as this:
select * from AST_tblAttributes
where intAttributeID in (
select intAttributeID from AST_tblAttributes ....
union all
select intAttributeID from AST_tblAttributes ....
)
And then go about physical plan selection. But to get there, it has to see though the cruft first, and that may fall outside the optimization window.
EDIT:
Really, the way to explore this for yourself, and corroborate the speculation above, is to put both versions of the query in the same window and compare estimated execution plans side-by-side (Ctrl-L in SSMS). Leave one as is, edit the other, and see what changes.
You will see that some alternate forms are recognized as logically equivalent and generate to the same good plan, and others generate less optimal plans, as you bork the optimizer.**
Then, you can use SET STATISTICS IO ON and SET STATISTICS TIME ON to observe the actual amount of work SQL Server performs to execute the queries:
SET STATISTICS IO ON
SET STATISTICS TIME ON
SELECT ....
SELECT ....
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
The output will appear in the messages pane.
** Or not--if they all generate the same plan, but actual execution time still varies like you say, something else may be going on--it's not unheard of. Try comparing actual execution plans and go from there.
El Ronnoco
First of all a possible explanation:
You say that: "This query returns 5 rows and executes in under 1 second.". But how many rows does it ESTIMATE are returned? If the estimate is very much off, using the query as part of the IN part could cause you to scan the entire: AST_tblAttributes in the outer part, instead of index seeking it (which could explain the big difference)
If you shared the query plans for the different variants (as a file, please), I think I should be able to get you an idea of what is going on under the hood here. It would also allow us to validate the explanation.
Edit: each DISTINCT keyword adds a new Sort node to your query plan. Basically, by having those other DISTINCTs in there, you're forcing SQL to re-sort the entire table again and again to make sure that it isn't returning duplicates. Each such operation can quadruple the cost of the query. Here's a good review of the effects that the DISTINCT operator can have, intended an unintended. I've been bitten by this, myself.
Are you using SQL 2008? If so, you can try this, putting the DISTINCT work into a CTE and then joining to your main table. I've found CTEs to be pretty fast:
WITH DistinctAttribID
AS
(
SELECT Distinct([TargetAttributeID])
FROM (
SELECT distinct att1.intAttributeID as [TargetAttributeID]
FROM AST_tblAttributes att1
INNER JOIN
AST_lnkProfileDemandAttributes pda
ON pda.intAttributeID=att1.intAttributeID AND pda.intProfileID = #intProfileID
UNION ALL
SELECT distinct ca2.intAttributeID as [TargetAttributeID] FROM
AST_lnkCapturePolicyAttributes ca2
INNER JOIN
AST_lnkEmployeeCapture ec2 ON ec2.intAdminCaptureID = ca2.intAdminCaptureID AND ec2.intTeamID = 57
WHERE ec2.dteCreatedDate >= #cutoffdate
) x
SELECT attx.intAttributeID,
attx.txtAttributeName,
attx.txtAttributeLabel,
attx.txtType,
attx.txtEntity
FROM AST_tblAttributes attx
JOIN DistinctAttribID attrib
ON attx.intAttributeID = attrib.TargetAttributeID

Optimizing ROW_NUMBER() in SQL Server

We have a number of machines which record data into a database at sporadic intervals. For each record, I'd like to obtain the time period between this recording and the previous recording.
I can do this using ROW_NUMBER as follows:
WITH TempTable AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Machine_ID ORDER BY Date_Time) AS Ordering
FROM dbo.DataTable
)
SELECT [Current].*, Previous.Date_Time AS PreviousDateTime
FROM TempTable AS [Current]
INNER JOIN TempTable AS Previous
ON [Current].Machine_ID = Previous.Machine_ID
AND Previous.Ordering = [Current].Ordering + 1
The problem is, it goes really slow (several minutes on a table with about 10k entries) - I tried creating separate indicies on Machine_ID and Date_Time, and a single joined-index, but nothing helps.
Is there anyway to rewrite this query to go faster?
The given ROW_NUMBER() partition and order require an index on (Machine_ID, Date_Time) to satisfy in one pass:
CREATE INDEX idxMachineIDDateTime ON DataTable (Machine_ID, Date_Time);
Separate indexes on Machine_ID and Date_Time will help little, if any.
How does it compare to this version?:
SELECT x.*
,(SELECT MAX(Date_Time)
FROM dbo.DataTable
WHERE Machine_ID = x.Machine_ID
AND Date_Time < x.Date_Time
) AS PreviousDateTime
FROM dbo.DataTable AS x
Or this version?:
SELECT x.*
,triang_join.PreviousDateTime
FROM dbo.DataTable AS x
INNER JOIN (
SELECT l.Machine_ID, l.Date_Time, MAX(r.Date_Time) AS PreviousDateTime
FROM dbo.DataTable AS l
LEFT JOIN dbo.DataTable AS r
ON l.Machine_ID = r.Machine_ID
AND l.Date_Time > r.Date_Time
GROUP BY l.Machine_ID, l.Date_Time
) AS triang_join
ON triang_join.Machine_ID = x.Machine_ID
AND triang_join.Date_Time = x.Date_Time
Both would perform best with an index on Machine_ID, Date_Time and for correct results, I'm assuming that this is unique.
You haven't mentioned what is hidden away in * and that can sometimes means a lot since a Machine_ID, Date_Time index will not generally be covering and if you have a lot of columns there or they have a lot of data, ...
If the number of rows in dbo.DataTable is large then it is likely that you are experiencing the issue due to the CTE self joining onto itself. There is a blog post explaining the issue in some detail here
Occasionally in such cases I have resorted to creating a temporary table to insert the result of the CTE query into and then doing the joins against that temporary table (although this has usually been for cases where a large number of joins against the temp table are required - in the case of a single join the performance difference will be less noticable)
I have had some strange performance problems using CTEs in SQL Server 2005. In many cases, replacing the CTE with a real temp table solved the problem.
I would try this before going any further with using a CTE.
I never found any explanation for the performance problems I've seen, and really didn't have any time to dig into the root causes. However I always suspected that the engine couldn't optimize the CTE in the same way that it can optimize a temp table (which can be indexed if more optimization is needed).
Update
After your comment that this is a view, I would first test the query with a temp table to see if that performs better.
If it does, and using a stored proc is not an option, you might consider making the current CTE into an indexed/materialized view. You will want to read up on the subject before going down this road, as whether this is a good idea depends on a lot of factors, not the least of which is how often the data is updated.
What if you use a trigger to store the last timestamp an subtract each time to get the difference?
If you require this data often, rather than calculate it each time you pull the data, why not add a column and calculate/populate it whenever row is added?
(Remus' compound index will make the query fast; running it only once should make it faster still.)

Resources