SQL Server and intermediate materialization? - sql-server

After reading this interesting article about intermediate materialization - I still have some questions.
I have this query :
SELECT *
FROM ...
WHERE isnumeric(MyCol)=1 and ( CAST( MyCol AS int)>1)
However, the where clause order is not deterministic.
So I might get exception here.( if he first tries to cast "k1k1" )
I assume this will solve the problem
SELECT MyCol
FROM
(SELECT TOP 100 PERCENT foo From MyTable WHERE ISNUMERIC (MyCol ) > 1 ORDER BY MyCol ) bar
WHERE
CAST(MyCol AS int) > 100
why does putting top 100 + order will change VS my regular query ?
I read in the comments :
(the "intermediate" result -- in other words, a result obtained during
the process, that will be used to calculate the final result) will be
physically stored ("materialized") in TempDB and used from there for
the remainder of the user, instead of being queried back from the base
tables.
what difference does it makes if it is stored in tempDB or queried back from the base tables? it is the same data !

The supported way to avoid errors due to the optimizer reorganizing things is to use CASE:
SELECT *
FROM YourTable
WHERE
1 <=
CASE
WHEN aa NOT LIKE '%[^0-9]%'
THEN CONVERT(int, aa)
ELSE 0
END;
Intermediate materialization is not a supported technique, so it should only be employed by very expert users in special circumstances where the risks are understood and accepted.
TOP 100 PERCENT is generally ignored by the optimizer in SQL Server 2005 onward.

By adding the TOP clause into the inner query, you're forcing SQL Server to run that query first before it runs the outer query - thereby discarding all rows for which ISNUMERIC returns false.
Without the TOP clause, the optimiser can rewrite the query to be the same as your first query.

Related

Microsoft SQL Server: run arbitrary query and save result into temp table

Given an arbitrary select query, how can I save its results into a temporary table?
To simplify things let's assume the select query does not contain an order by clause at the top level; it's not dynamic SQL; it really is a select (not a stored procedure call), and it's a single query (not something that returns multiple result sets). All of the columns have an explicit name. How can I run it and save the results to a temp table? Either by processing the SQL on the client side, or by something clever in T-SQL.
I am not asking about any particular query -- obviously, given some particular SQL I could rewrite it by hand to save into a temp table -- but about a rule that will work in general and can be programmed.
One possible "answer" that does not work in general
For simple queries you can do
select * into #tmp from (undl) x where undl is the underlying SQL query. But this fails if undl is a more complex query; for example if it uses common table expressions using with.
For similar reasons with x as (undl) select * into #tmp from x does not work in general; with clauses cannot be nested.
My current approach, but not easy to program
The best I've found is to find the top level select of the query and munge it to add into #tmp just before the from keyword. But finding which select to munge is not easy; it requires parsing the whole query in the general case.
Possible solution with user-defined function
One approach may be to create a user-defined function wrapping the query, then select * into #tmp from dbo.my_function() and drop the function afterwards. Is there something better?
More detail on why the simple approach fails when the underlying uses CTEs. Suppose I try the rule select * into #tmp from (undl) x where undl is the underlying SQL. Now let undl be with mycte as (select 5 as mycol) select mycol from mycte. Once the rule is applied, the final query is select * into #tmp from (with mycte as (select 5 as mycol) select mycol from mycte) x which is not valid SQL, at least not on my version (MSSQL 2016). with clauses cannot be nested.
To be clear, CTEs must be defined at the top level before the select. They cannot be nested and cannot appear in subqueries. I fully understand that and it's why I am asking this question. An attempt to wrap the SQL that ends up trying to nest the CTEs will not work. I am looking for an approach that will work.
"Put an into right before the select". This will certainly work but requires parsing the SQL in the general case. It's not always obvious (to a computer program) which select needs to change. I did try the rule of adding it to the last select in the query, but this also fails. For example if the underlying query is
with mycte as (select 5 as mycol) select mycol from mycte except select 6
then the into #x needs to be added to the second select, not to the one that appears after except. Getting this right in the general case involves parsing the SQL into a syntax tree.
In the end creating a user-defined function appears to be the only general answer. If undl is the underlying select query, then you can say
create function dbo.myfunc() returns table as return (undl)
go
select * into #tmp from dbo.myfunc()
go
drop function dbo.myfunc
go
The pseudo-SQL go indicates starting a new batch. The create function must be executed in one batch before the select, otherwise you get a syntax error. (Just separating them with ; is not enough.)
This approach works even when undl contains subqueries or common table expressions using with. However, it does not work when the query uses temporary tables.

erratic "delayed" CTE evaluation?

I observe a behaviour with CTEs which I did not expect (and seems inconsistent).
Not quite sure that it is correct...
Basically, through a CTE, I filter rows to avoid a particular problem, then use the result of that CTE to perform calculations that would break on the problematic rows which I thought I eliminated in my CTE...
Take a simple table with a varchar column that often has a number in it, but not always
CREATE TABLE MY_TABLE(ROW_ID INTEGER NOT NULL
, GOOD_ROW BOOLEAN NOT NULL
, SOME_VALUE VARCHAR NOT NULL);
INSERT INTO MY_TABLE(ROW_ID, GOOD_ROW, SOME_VALUE)
VALUES(1, TRUE, '1'), (2, TRUE, '2'), (3, FALSE, 'ABC');
I also create a small table with just numbers to join on
CREATE TABLE NUMBERS(NUMBER_ID INTEGER NOT NULL);
INSERT INTO NUMBERS(NUMBER_ID) VALUES(1), (2), (3);
Joining these two tables on SOME_VALUE results in an error because 'ABC' is not numeric and it appears that the JOIN is evaluated BEFORE the WHERE clause (BAD implications on performance here...)
SELECT *
FROM MY_TABLE
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TO_NUMBER(SOME_VALUE)
WHERE ROW_ID < 3; --> ERROR
So, I try to filter my first table through a CTE which only return rows for which SOME_VALUE is numeric
WITH ONLY_GOOD_ONES
AS (
SELECT SOME_VALUE
FROM MY_TABLE
WHERE GOOD_ROW = TRUE
)
SELECT *
FROM ONLY_GOOD_ONES;
Now, I would expect to be able to use the result of this CTE with SOME_VALUE being numeric.
WITH ONLY_GOOD_ONES
AS (
SELECT SOME_VALUE
FROM MY_TABLE
WHERE GOOD_ROW = TRUE
)
SELECT *
FROM ONLY_GOOD_ONES
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TO_NUMBER(SOME_VALUE);
Miracle!!!
It worked!
I get my 2 expected records.
So far so good...
However, if I had defined my CTE slightly differently (WHERE clause which filters the same records)
WITH ONLY_GOOD_ONES
AS (
SELECT SOME_VALUE
FROM MY_TABLE
WHERE ROW_ID < 3
)
SELECT *
FROM ONLY_GOOD_ONES;
This CTE returns exactly the same thing as before
But if I try to join, it Fails!
WITH ONLY_GOOD_ONES
AS (
SELECT *
FROM MY_TABLE
WHERE ROW_ID < 3
)
SELECT *
FROM ONLY_GOOD_ONES
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TO_NUMBER(SOME_VALUE);
I get the following error...
SQL Error [100038] [22018]: Numeric value 'ABC' is not recognized
Is there a particular explanation to this second version of the CTE behaving differently???
The actual answer is because snowflake does not follow the SQL standard, and execute SQL in the order given.
They apply transforms to data prior to filtering when there optimizer decides it wants to.
So for your table MY_TABLE when you do
SELECT some_value::NUMBER FROM my_table WHERE row_id IN (1,2);
You will under some cases have the as_number cast happen on all row, and explode on the 'ABC'. Which is violating SQL rules, that WHERE are evaluated before SELECT transforms are done, but Snowflake have known this for years, and it's intentional, as it makes things run faster.
The solution is to understand you have mixed data and therefore assume the code can and will be ran out of order, and thus use the protective versions of the functions like TRY_TO_NUMBER
The kicker is you can write a few nested SELECTs to avoid the problem and then put something like a window funcation around the code and the optimizer jump back into this behavour and you SQL explodes again. Thus the solution is to understand if you have mixed data, and handle it. Oh and complain it's a bug.
This is because you're getting a different execution plan with the different queries.
Here's how the query is executed with the working query:
... and here is how it's executed with the query generating a failure. The error comes from the fact that the join filter is applied directly on the table scan before the ROW_ID < 3 filter is applied, compared to the working query.
You can see these plans under history, clicking the query id and then the 'profile' tab.
It looks like the join filter is applied so early, maybe because of a wrong estimation. When I run the queries on my test database, they completed without any error.
To overcome the issue, you can always "Error-handling Conversion Functions":
SELECT *
FROM MY_TABLE
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TRY_TO_NUMBER(SOME_VALUE)
WHERE ROW_ID < 3;
More information:
https://docs.snowflake.com/en/sql-reference/functions-conversion.html#label-try-conversion-functions

sql constant function causes inferior query plan to be used

I'm on SQL Server 2016 and am seeing the following:
I have a simple query similar to:
select distinct col1
from tbl
where
col2 > 12345
If I move the constant value into a function, the query plan changes (for the worse, by A LOT):
select distinct col1
from tbl
where
col2 > dbo.fn12345()
where the function is
create function dbo.fn12345()
returns int
as begin
return 12345
end
here are screenshots of the plans (using my actual schema so the identifiers are different than the illustrative example.
without function:
with function:
With the 2nd plan my execution time goes from 22s to 96s.
Is there any way to fix this while still using functions?
Please no questions asking why I just can't inline the constant. The same issue occurs for more complex functions that include sargable logic- inlining what is effectively a complex constant calculation changes the query plan.
I am also aware that my index is not optimal. This is by design. The table is very large and this particular query doesn't warrant the storage for a dedicated index.
You are always going to run into problems with functions in where clauses.
even something as straightforward as ISNULL() can change the plan.
Is there any way you can persist the computed result in a table (even a temp table)? Then you can cross join to this.
NB - Create stats on your table as this will help the optimizer.
SELECT 12345 as val into #t
select distinct col1
from tbl
CROSS JOIN #t
where
col2 > val

SQL Server: order of returned rows when using IN clause

Running the following query returns 4 rows. As I can see in SSMS the order of returned rows is the same as I specified in the IN clause.
SELECT * FROM Table WHERE ID IN (4,3,2,1)
Can I say that the order of returned rows are ALWAYS the same as they appear in the IN clause?
If yes then is it true, that the following two queries return the rows in the same order? (as I've tested the orders are the same, but I don't know if I can trust this behavior)
SELECT TOP 10 * FROM Table ORDER BY LastModification DESC
SELECT * FROM Table WHERE ID IN (SELECT TOP 10 ID FROM Table ORDER BY LastModification DESC)
I ask this question because I have a quite complex select query. Using this trick over it brings me ca. 30% performance gain, in my case.
You cannot guarantee the records to be in any particular order unless you use ORDER BY clause. You may use some tricks that may work some of the time but they won't give you guarantee of the order.

How to force SQL Server to process CONTAINS clauses before WHERE clauses?

I have a SQL query that uses both standard WHERE clauses and full text index CONTAINS clauses. The query is built dynamically from code and includes a variable number of WHERE and CONTAINS clauses.
In order for the query to be fast, it is very important that the full text index be searched before the rest of the criteria are applied.
However, SQL Server chooses to process the WHERE clauses before the CONTAINS clauses and that causes tables scans and the query is very slow.
I'm able to rewrite this using two queries and a temporary table. When I do so, the query executes 10 times faster. But I don't want to do that in the code that creates the query because it is too complex.
Is there an a way to force SQL Server to process the CONTAINS before anything else? I can't force a plan (USE PLAN) because the query is built dynamically and varies a lot.
Note: I have the same problem on SQL Server 2005 and SQL Server 2008.
You can signal your intent to the optimiser like this
SELECT
*
FROM
(
SELECT *
FROM
WHERE
CONTAINS
) T1
WHERE
(normal conditions)
However, SQL is declarative: you say what you want, not how to do it. So the optimiser may decide to ignore the nesting above.
You can force the derived table with CONTAINS to be materialised before the classic WHERE clause is applied. I won't guarantee performance.
SELECT
*
FROM
(
SELECT TOP 2000000000
*
FROM
....
WHERE
CONTAINS
ORDER BY
SomeID
) T1
WHERE
(normal conditions)
Try doing it with 2 queries without temp tables:
SELECT *
FROM table
WHERE id IN (
SELECT id
FROM table
WHERE contains_criterias
)
AND further_where_classes
As I noted above, this is NOT as clean a way to "materialize" the derived table as the TOP clause that #gbn proposed, but a loop join hint forces an order of evaluation, and has worked for me in the past (admittedly usually with two different tables involved). There are a couple of problems though:
The query is ugly
you still don't get any guarantees that the other WHERE parameters don't get evaluated until after the join (I'll be interested to see what you get)
Here it is though, given that you asked:
SELECT OriginalTable.XXX
FROM (
SELECT XXX
FROM OriginalTable
WHERE
CONTAINS XXX
) AS ContainsCheck
INNER LOOP JOIN OriginalTable
ON ContainsCheck.PrimaryKeyColumns = OriginalTable.PrimaryKeyColumns
AND OriginalTable.OtherWhereConditions = OtherValues

Resources