erratic "delayed" CTE evaluation? - snowflake-cloud-data-platform

I observe a behaviour with CTEs which I did not expect (and seems inconsistent).
Not quite sure that it is correct...
Basically, through a CTE, I filter rows to avoid a particular problem, then use the result of that CTE to perform calculations that would break on the problematic rows which I thought I eliminated in my CTE...
Take a simple table with a varchar column that often has a number in it, but not always
CREATE TABLE MY_TABLE(ROW_ID INTEGER NOT NULL
, GOOD_ROW BOOLEAN NOT NULL
, SOME_VALUE VARCHAR NOT NULL);
INSERT INTO MY_TABLE(ROW_ID, GOOD_ROW, SOME_VALUE)
VALUES(1, TRUE, '1'), (2, TRUE, '2'), (3, FALSE, 'ABC');
I also create a small table with just numbers to join on
CREATE TABLE NUMBERS(NUMBER_ID INTEGER NOT NULL);
INSERT INTO NUMBERS(NUMBER_ID) VALUES(1), (2), (3);
Joining these two tables on SOME_VALUE results in an error because 'ABC' is not numeric and it appears that the JOIN is evaluated BEFORE the WHERE clause (BAD implications on performance here...)
SELECT *
FROM MY_TABLE
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TO_NUMBER(SOME_VALUE)
WHERE ROW_ID < 3; --> ERROR
So, I try to filter my first table through a CTE which only return rows for which SOME_VALUE is numeric
WITH ONLY_GOOD_ONES
AS (
SELECT SOME_VALUE
FROM MY_TABLE
WHERE GOOD_ROW = TRUE
)
SELECT *
FROM ONLY_GOOD_ONES;
Now, I would expect to be able to use the result of this CTE with SOME_VALUE being numeric.
WITH ONLY_GOOD_ONES
AS (
SELECT SOME_VALUE
FROM MY_TABLE
WHERE GOOD_ROW = TRUE
)
SELECT *
FROM ONLY_GOOD_ONES
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TO_NUMBER(SOME_VALUE);
Miracle!!!
It worked!
I get my 2 expected records.
So far so good...
However, if I had defined my CTE slightly differently (WHERE clause which filters the same records)
WITH ONLY_GOOD_ONES
AS (
SELECT SOME_VALUE
FROM MY_TABLE
WHERE ROW_ID < 3
)
SELECT *
FROM ONLY_GOOD_ONES;
This CTE returns exactly the same thing as before
But if I try to join, it Fails!
WITH ONLY_GOOD_ONES
AS (
SELECT *
FROM MY_TABLE
WHERE ROW_ID < 3
)
SELECT *
FROM ONLY_GOOD_ONES
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TO_NUMBER(SOME_VALUE);
I get the following error...
SQL Error [100038] [22018]: Numeric value 'ABC' is not recognized
Is there a particular explanation to this second version of the CTE behaving differently???

The actual answer is because snowflake does not follow the SQL standard, and execute SQL in the order given.
They apply transforms to data prior to filtering when there optimizer decides it wants to.
So for your table MY_TABLE when you do
SELECT some_value::NUMBER FROM my_table WHERE row_id IN (1,2);
You will under some cases have the as_number cast happen on all row, and explode on the 'ABC'. Which is violating SQL rules, that WHERE are evaluated before SELECT transforms are done, but Snowflake have known this for years, and it's intentional, as it makes things run faster.
The solution is to understand you have mixed data and therefore assume the code can and will be ran out of order, and thus use the protective versions of the functions like TRY_TO_NUMBER
The kicker is you can write a few nested SELECTs to avoid the problem and then put something like a window funcation around the code and the optimizer jump back into this behavour and you SQL explodes again. Thus the solution is to understand if you have mixed data, and handle it. Oh and complain it's a bug.

This is because you're getting a different execution plan with the different queries.
Here's how the query is executed with the working query:
... and here is how it's executed with the query generating a failure. The error comes from the fact that the join filter is applied directly on the table scan before the ROW_ID < 3 filter is applied, compared to the working query.
You can see these plans under history, clicking the query id and then the 'profile' tab.

It looks like the join filter is applied so early, maybe because of a wrong estimation. When I run the queries on my test database, they completed without any error.
To overcome the issue, you can always "Error-handling Conversion Functions":
SELECT *
FROM MY_TABLE
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TRY_TO_NUMBER(SOME_VALUE)
WHERE ROW_ID < 3;
More information:
https://docs.snowflake.com/en/sql-reference/functions-conversion.html#label-try-conversion-functions

Related

SQL Server - UNION with WHERE clause outside is extremely slow on simple join

I have a simple query and it works fast (<1sec):
;WITH JointIncomingData AS
(
SELECT A, B, C, D FROM dbo.table1
UNION ALL
SELECT A, B, C, D FROM dbo.table2
)
SELECT *
FROM JointIncomingData D
WHERE a = '1/1/2020'
However, if I join with another small table in the final SELECT statement it is extremely slow (> 30 sec)
DECLARE #anotherTable TABLE (A DATE, B INT)
INSERT INTO #anotherTable (AsOfDate, FundId)
VALUES ('1/1/2020', 1)
;WITH JointIncomingData AS
(
SELECT A, B, C, D FROM dbo.table1
UNION ALL
SELECT A, B, C, D FROM dbo.table2
)
SELECT *
FROM JointIncomingData D
JOIN #anotherTable T ON T.A = D.A AND T.B = D.B
In the real application, I have a complex UPDATE as the final statement, so I try to avoid copy-paste and introduces UNION to consolidate code.
But now experience an unexpected issue with slowness.
I tried using UNION ALL instead of UNION - with the same result.
Looks like SQL Server pushed simple conditions to each of UNION statements, but when I join it with another table, it doesn't happen and a table scan occurs.
Any advice?
UPDATE: Here is estimated plans
for the first simple condition query: https://www.brentozar.com/pastetheplan/?id=SJ5fynTgP
for the query with a join table: https://www.brentozar.com/pastetheplan/?id=H1eny3pxP
Please keep in mind that estimated plans are not exactly for the above query, but more real one, having exactly the same problem.
When I'm doing complex updates I normally declare a temp table and insert the rows into it that I intend to update. There's two benefits to this approach, one being that by explicitly collecting the rows to be updated you simplify the logic and make the update itself really simple (just update the rows whose primary key is in your temp table). The other big benefit of it is you can do some sanity checking before actually running your update, and "throw an error" by returning a different value.
I think it's generally a good practice to break down queries into simple steps like this, because it makes them much easier to troubleshoot in the future.
Based on the "similar" execution plan you shared. It would also be better to have the actual plan, to know if your estimates and memory grants are ok.
Key lookup
The index IX_dperf_date_fund should be extended to INCLUDE the following columns nav, equity
Why? Every row the index returns, create a lookup in the clusterd index to retrieve the column values of nav, equity.
Only if this is reasonable for the application, if other queries may benefit as well
CTE
Change your CTE to a temp table.
Example:
SELECT *
INTO #JointIncomingData
FROM (
SELECT AsOfDate, FundId, DataSourceId, ShareClass, NetAssetsBase, SharesOutstanding
FROM
ETL.tblIncomingData
UNION ALL
SELECT AsOfDate, FundId, DataSourceId, ShareClass, NetAssetsBase, SharesOutstanding
FROM ETL.vIncomingDataDPerf
) x
Why? CTE's are not materialized. and this answer
Bonus: parameter sniffing
If you pass in parameters you might be suffering from parameters sniffing.

Select where id not in (...) has any limit? - NOT IN VS NOT EXISTS

I want to select some rows from a catalog which are not related to two table (Table1 and Table2).
The rows in the catalog are related to the tables using the field idA.
I am using the following query:
;with CTE(id) AS(
select distinct idA from Table1
union
select distinct idA from Table2
)
select * from Catalog where IdA NOT IN (select id from CTE)
The very strange part is that this query does not return any result.
If I use the following query after de CTE it will return some rows
select Id from Catalog where not exists (select 1 from CTE where Catalog.IdA = CTE.Id);
Does anyone know what is the reason of this? AFAK both queries are equivalents.
Of course the first query is slower than the second one but that is not very important. The important part is, why those queries do not return the same results?
Maybe it's important to mention that Table1 has more than 3.9 million rows
but just 139,283 different values for idA.
Table2 has only some thousands rows
Any help or comment will be appreciated
If one or more NULL values are returned by a NOT IN subquery, the result of the predicate is unknown rather than true or false and no rows returned due to the unsatisfied condition. It is best to avoid NOT IN with nullable expressions to avoid this non-intuitive behavior.
Although slighly more verbose, a correlated NOT EXISTS subquery will always return true or false and avoid the NULL gotcha.

Single condition slows down SQL query drastically

I have a SQL query looking something like this:
WITH RES_CTE AS
(SELECT
COLUMN1,
COLUMN2,
[MORE COLUMNS...]
ROW_NUMBER() OVER (ORDER BY R.RANKING DESC) AS RowNum
FROM TABLE1 As R, TABLE2 As A, TABLE3 As U, TABLE4 As S, TABLE5 As T
WHERE R.RID = A.LID
AND S.QRYID = R.QRYID
AND A.AID = U.AID
AND CONDITION1 = 'VALUE'
AND CONDITION2 = 'VALUE'
AND [MORE CONDITIONS...]
),
Results_Cnt AS
(SELECT COUNT(*) CNT FROM Results_CTE)
SELECT * FROM Results_CTE, Results_Cnt WHERE RowNum >= 1 AND RowNum <= 25
Now, this query typically runs under 1 sec and returns the 25 records out of 5000 based on CONDITION1.
Recently though, I added a new column to a TABLE1 and then use its values as a CONDITION2 in the query above. The column is populated going forward but all the values in the past are NULL.
I read something above joining table that have NULL being a reason for slow execution. The table has about 1,300,000 records. 90% of them are NULL in the problematic column. But that column is not being joined on. (The one that is being joined on has an INDEX)
However, I wanted to try that anyway by creating a new column and simply copying the data like so:
ALTER TABLE TABLE1 ADD COL_NEW
UPDATE TABLE1 SET COL_NEW = COL_OLD
My next step was to replace the NULLs with an actual value but first, just for kicks, I changed the query to use as a condition the new field COL_NEW, and the problem went away.
Although I'm happy the problem is gone, I can't explain it to myself. Why was the execution slow in the first place if it had nothing to do with the NULLs?
UPDATE: It appears the problem may have resulted from a cached query plan. So the question essentially becomes, how to force a query plan refresh?
UPDATE: Although doing ALTER TABLE may have refreshed the execution plan, the problem returned. How can I find out what is happening?
It sounds like your query plan got cached while the stats for the new column showed it completely full of nulls, forcing a table scan. Following the ALTER TABLE the query plan was refreshed, replcing the table scan with an index lookujp again, and performance returned to normal.
The only way to know for sure if that is what happened would be to examine the query plans for both queries, but those are long gone now.

Transact SQL parallel query execution

Suppose I have
INSERT INTO #tmp1 (ID) SELECT ID FROM Table1 WHERE Name = 'A'
INSERT INTO #tmp2 (ID) SELECT ID FROM Table2 WHERE Name = 'B'
SELECT ID FROM #tmp1 UNION ALL SELECT ID FROM #tmp3
I would like to run queries 1 & 2 in parallel, and then combine results after they are finished.
Is there a way to do this in pure T-SQL, or a way to check if it will do this automatically?
A background for those who wants it: I investigate a complex search where there're multiple conditions which are later combined (term OR (term2 AND term3) OR term4 AND item5=term5) and thus I investigate if it would be useful to execute those - largely unrelated - conditions in parallel, later combining resulting tables (and calculating ranks, weights, and so on).
E.g. should be several resultsets:
SELECT COUNT(*) #tmp1 union #tmp3
SELECT ID from (#tmp1 union #tmp2) WHERE ...
SELECT * from TABLE3 where ID IN (SELECT ID FROM #tmp1 union #tmp2)
SELECT * from TABLE4 where ID IN (SELECT ID FROM #tmp1 union #tmp2)
You don't. SQL doesn't work like that: it isn't procedural. It leads to race conditions and data issues because of other connections
Table variables are also scoped to the batch and connection so you can't share results over 2 connections in case you're wondering.
In any case, all you need is this, unless you gave us an bad example:
SELECT ID FROM Table1 WHERE Name = 'A'
UNION
SELECT ID FROM Table2 WHERE Name = 'B'
I suspect you're thinking of "run in parallel" because of this procedural thinking. What is your actual desired problem and goal?
Note: table variables do not allow parallel operations: Can queries that read table variables generate parallel exection plans in SQL Server 2008?
You don't decide what to parallelise - SQL Server's optimizer does. And the largest unit of work that the optimizer will work with is a single statement - so, you find a way to express your query as a single statement, and then rely on SQL Server to do its job, which it will usually do quite well.
If, having constructed your query, the performance isn't acceptable, then you can look at applying hints or forcing certain plans to be used. A lot of people break their queries into multiple statements, either believing that they can do a better job than SQL Server, or because it's how they "naturally" think of the task at hand. Both are "wrong" (for certain values of wrong), but if there's a natural breakdown, you may be able to replicate it using Common Table Expressions - these would allow you to name each sub-part of the problem, and then combine them together, all as part of a single statement.
E.g.:
;WITH TabA AS (
SELECT ID FROM Table1 WHERE Name = 'A'
), TabB AS (
SELECT ID FROM Table2 WHERE Name = 'B'
)
SELECT ID FROM TabA UNION ALL SELECT ID FROM TabB
And this will allow the server to decide how best to resolve this query (e.g. deciding whether to store intermediate results in "temp" tables)
Seeing in one of your other comments you discussing about having to "work with" the intermediate results - this can still be done with CTEs (if it's not just a case of you failing to be able to express the "final" result as a single query), e.g.:
;WITH TabA AS (
SELECT ID FROM Table1 WHERE Name = 'A'
), TabAWithCalcs AS (
SELECT ID,(ID*5+6) as ModID from TabA
)
SELECT * FROM TabAWithCalcs
Why not just:
SELECT ID FROM Table1 WHERE Name = 'A'
UNION ALL
SELECT ID FROM Table2 WHERE Name = 'B'
then if SQL Server wants to run the two selects in parallel, it will do at its own violition.
Otherwise we need more context for what you're trying to achieve if this isn't practical.

Output to Table Variable, Not CTE

Why is this legal:
DECLARE #Party TABLE
(
PartyID nvarchar(10)
)
INSERT INTO #Party
SELECT Name FROM
(INSERT INTO SomeOtherTable
OUTPUT inserted.Name
VALUES ('hello')) [H]
SELECT * FROM #Party
But the next block gives me an error:
WITH Hey (Name)
AS (
SELECT Name FROM
(INSERT INTO SomeOtherTable
OUTPUT inserted.Name
VALUES ('hello')) [H]
)
SELECT * FROM Hey
The second block gives me the error "A nested INSERT, UPDATE, DELETE, or MERGE statement is not allowed in a SELECT statement that is not the immediate source of rows for an INSERT statement.
It seems to be saying thst nested INSERT statements are allowed, but in my CTE case I did not nest inside another INSERT. I'm surprised at this restriction. Any work-arounds in my CTE case?
As for why this is illegal, allowing these SELECT operations with side effects would cause all sorts of problems I imagine.
CTEs do not get materialised in advance into their own temporary table so what should the following return?
;WITH Hey (Name)
AS
(
...
)
SELECT name
FROM Hey
JOIN some_other_table ON Name = name
If the Query Optimiser decided to use a nested loops plan and Hey as the driving table then presumably one insert would occur. However if it used some_other_table as the driving table then the CTE would get evaluated as many times as there were rows in that other table and so multiple inserts would occur. Except if the Query Optimiser decided to add a spool to the plan and then it would only get evaluated once.
Presumably avoiding this sort of mess is the motivation for this restriction (as with the restrictions on side effects in functions)

Resources