WITH is not working as I expected sqlServer 2012 - sql-server

I am getting diferents results into a WITH statement. here is my first query:
with q as (select top (100000) * from table1) select * from q
Let's say that table1 has an ID field, everything seems to be normal if I execute that query, it works as I expected. But if I change the statement like this:
with q as (select top (100000) * from table1) select [ID] from q
or
with q as (select top (100000) * from table1) select q.[ID] from q
it brings me results that does not exists into the first query (note that I only bring ID). I understand that WITH statement is a temporal result set an I expect that both queries brings the same result no matter how many fields I select, so why is this happening?, this could be a problem if i want to perform an update or even worst if I do a delete I will not be completely sure if I have affected the rows that I wanted

If you select top x without an order by, the result set is arbitrarily returned. Meaning you can get a different result set if you execute it twice. Since you are changing the query slightly, I'm not surprised the result set is different. Add an ORDER BY if you SELECT TOP x

Related

Getting results from results

I apologize if I don't make much sense but I've tangled my brain up trying to work this out.
I'm trying to obtain a result set using the results from one query but then also hoping to include the previous results within the new query and then somehow group them.
What I have are parent Work order numbers and it’s child work order numbers.
Sadly the system I am using doesn't have the functionality set up yet to simply produce a report that shows all the specific type of work and their linked work.
So I have an initial basic query 1 to find anything that has a "JPNUM like AK0147" and "STATUS NOT IN ('COMPLETE', 'CANCELLED', 'REVIEWED', 'CLOSED')"
The result of the above query 1 will return a result set that includes the column 'WONUM'.
I need to then do a separate search using the column 'PARENT' whereby I return any results that have a number in this column matching any of the WONUMs that were returned in query 1.
I also want to include the results of query 1, probably in query 3, so I can group them together.
How do create write a query that includes my results from query 1 into query 2 and then how do I group them so I have the parent WONUM at the top and it's children work orders underneath, like the final results table I have shown in the attached image?
You could run a select from another select and so on.
I will write you an example:
SELECT WONUM, PARENT.WONUM
FROM (SELECT WONUM, JPNUM
FROM yourTable
WHERE "JPNUM like AK0147"
AND "STATUS NOT IN ('COMPLETE', 'CANCELLED', 'REVIEWED', 'CLOSED')") PARENT
WHERE ...
This way the result of the inner SELECT acts like a temporary table.
There's more than one way to do it, if you're using sql-server, I recommend CTE:
WITH Query1
(
SELECT WONUM, JPNUM
FROM MyTable1
WHERE ...
),
Query2
(
SELECT WONUM, PARENT
FROM Query1 -- You can use Query1, if you want
JOIN MyTable2 ON Query1.JPNUM = ...
WHERE ...
)
-- Final Result:
SELECT WONUM, PARENT
FROM Query2
JOIN Query1 ON ...
JOIN Table3 ON ...
WHERE ...
In this way, you can query using previous query or previous previous query (if needed).

erratic "delayed" CTE evaluation?

I observe a behaviour with CTEs which I did not expect (and seems inconsistent).
Not quite sure that it is correct...
Basically, through a CTE, I filter rows to avoid a particular problem, then use the result of that CTE to perform calculations that would break on the problematic rows which I thought I eliminated in my CTE...
Take a simple table with a varchar column that often has a number in it, but not always
CREATE TABLE MY_TABLE(ROW_ID INTEGER NOT NULL
, GOOD_ROW BOOLEAN NOT NULL
, SOME_VALUE VARCHAR NOT NULL);
INSERT INTO MY_TABLE(ROW_ID, GOOD_ROW, SOME_VALUE)
VALUES(1, TRUE, '1'), (2, TRUE, '2'), (3, FALSE, 'ABC');
I also create a small table with just numbers to join on
CREATE TABLE NUMBERS(NUMBER_ID INTEGER NOT NULL);
INSERT INTO NUMBERS(NUMBER_ID) VALUES(1), (2), (3);
Joining these two tables on SOME_VALUE results in an error because 'ABC' is not numeric and it appears that the JOIN is evaluated BEFORE the WHERE clause (BAD implications on performance here...)
SELECT *
FROM MY_TABLE
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TO_NUMBER(SOME_VALUE)
WHERE ROW_ID < 3; --> ERROR
So, I try to filter my first table through a CTE which only return rows for which SOME_VALUE is numeric
WITH ONLY_GOOD_ONES
AS (
SELECT SOME_VALUE
FROM MY_TABLE
WHERE GOOD_ROW = TRUE
)
SELECT *
FROM ONLY_GOOD_ONES;
Now, I would expect to be able to use the result of this CTE with SOME_VALUE being numeric.
WITH ONLY_GOOD_ONES
AS (
SELECT SOME_VALUE
FROM MY_TABLE
WHERE GOOD_ROW = TRUE
)
SELECT *
FROM ONLY_GOOD_ONES
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TO_NUMBER(SOME_VALUE);
Miracle!!!
It worked!
I get my 2 expected records.
So far so good...
However, if I had defined my CTE slightly differently (WHERE clause which filters the same records)
WITH ONLY_GOOD_ONES
AS (
SELECT SOME_VALUE
FROM MY_TABLE
WHERE ROW_ID < 3
)
SELECT *
FROM ONLY_GOOD_ONES;
This CTE returns exactly the same thing as before
But if I try to join, it Fails!
WITH ONLY_GOOD_ONES
AS (
SELECT *
FROM MY_TABLE
WHERE ROW_ID < 3
)
SELECT *
FROM ONLY_GOOD_ONES
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TO_NUMBER(SOME_VALUE);
I get the following error...
SQL Error [100038] [22018]: Numeric value 'ABC' is not recognized
Is there a particular explanation to this second version of the CTE behaving differently???
The actual answer is because snowflake does not follow the SQL standard, and execute SQL in the order given.
They apply transforms to data prior to filtering when there optimizer decides it wants to.
So for your table MY_TABLE when you do
SELECT some_value::NUMBER FROM my_table WHERE row_id IN (1,2);
You will under some cases have the as_number cast happen on all row, and explode on the 'ABC'. Which is violating SQL rules, that WHERE are evaluated before SELECT transforms are done, but Snowflake have known this for years, and it's intentional, as it makes things run faster.
The solution is to understand you have mixed data and therefore assume the code can and will be ran out of order, and thus use the protective versions of the functions like TRY_TO_NUMBER
The kicker is you can write a few nested SELECTs to avoid the problem and then put something like a window funcation around the code and the optimizer jump back into this behavour and you SQL explodes again. Thus the solution is to understand if you have mixed data, and handle it. Oh and complain it's a bug.
This is because you're getting a different execution plan with the different queries.
Here's how the query is executed with the working query:
... and here is how it's executed with the query generating a failure. The error comes from the fact that the join filter is applied directly on the table scan before the ROW_ID < 3 filter is applied, compared to the working query.
You can see these plans under history, clicking the query id and then the 'profile' tab.
It looks like the join filter is applied so early, maybe because of a wrong estimation. When I run the queries on my test database, they completed without any error.
To overcome the issue, you can always "Error-handling Conversion Functions":
SELECT *
FROM MY_TABLE
INNER JOIN NUMBERS ON NUMBERS.NUMBER_ID = TRY_TO_NUMBER(SOME_VALUE)
WHERE ROW_ID < 3;
More information:
https://docs.snowflake.com/en/sql-reference/functions-conversion.html#label-try-conversion-functions

select top 1 * vs select top 1 1

I know there's a lot of these questions, but I can't find one that relates to my question.
Looking at this question, Is Changing IF EXIST(SELECT 1 FROM ) to IF EXIST(SELECT TOP 1 FROM ) has any side effects?
Specifically referring to this section in the answer:
select * from sys.objects
select top 1 * from sys.objects
select 1 where exists(select * from sys.objects)
select 1 where exists(select top 1 * from sys.objects)
I'm running some of my own tests to properly understand it. As indicated in the answer:
select 1 where exists(select top 1 * from sys.objects)
select 1 where exists(select top 1 1 from sys.objects)
both cause the same execution plan and also causes the same plan as
select 1 where exists(select * from sys.objects)
select 1 where exists(select 1 from sys.objects)
From my research into questions like this one, “SELECT TOP 1 1” VS “IF EXISTS(SELECT 1”. I'm deducing that this is the agreed best practice:
select 1 where exists(select * from sys.objects)
My first question is why is this preferred over this:
select 1 where exists(select 1 from sys.objects)
In trying to understand it, I broke them down to their more basic expressions (I'm using 'top 1' to mimic an execution plan resembling exists):
select top 1 * from sys.objects
select top 1 1 from sys.objects
I now see that the first is 80% of the execution time (relative to the batch of 2) whilst the second is only 20%. Would it then not be better practice to use
select 1 where exists(select 1 from sys.objects)
as it can be applied to both scenarios and thereby reduce possible human error?
SQL Server detects EXISTS predicate relatively early in the query compilation / optimisation process, and eliminates actual data retrieval for such clauses, replacing them with existence checks. So your assumption:
I now see that the first is 80% of the execution time (relative to the batch of 2) whilst the second is only 20%.
is wrong, because in the preceding comparison you have actually retrieved some data, which doesn't happen if the query is put into the (not) exists predicate.
Most of the time, there is no difference how to test for the existence of rows, except for a single yet important catch. Suppose you say:
if exists (select * from dbo.SomeTable)
...
somewhere in the code module (view, stored procedure, function etc.). Then, later, when someone else will decide to put WITH SCHEMABINDING clause into this code module, SQL Server will not allow it and instead of possibly binding to the current list of columns it will throw an error:
Msg 1054, Level 15, State 7, Procedure BoundView, Line 6
Syntax '*' is not allowed in schema-bound objects.
So, in short:
if exists (select 0 from ...)
is a safest, fastest and one-size-fits-all way for existence checks.
The difference between these two:
select top 1 * from sys.objects
select top 1 1 from sys.objects
Is that in the first clause SQL server must fetch all the columns from the table (from any random row), but in the second it's just ok to fetch "1" from any index.
Things change when these clauses are inside exists clause, because in that case SQL Server knows that it doesn't actually have to fetch the data because it will not be assigned to anything, so it can handle select * the same way it would handle select 1.
Since exists checks just one row, it has internal top 1 built into it, so adding it manually doesn't change anything.
Weather to have select * or select 1 in exists clause is just based on opinion, and instead of 1 you could of course have 2 or 'X' or whatever else you like. Personally I always use ... and exists (select 1 ...
EXISTS is a type of subquery which can only return a boolean value based upon whether any rows are returned by the subquery. Selecting 1, or * or, whatever doesn't matter within this context because the result is always just true or false.
You can verify this by testing that these two statements produce the exact same plan.
select 1 where exists(select * from sys.objects)
select 1 where exists(select 1 from sys.objects)
What you select in your outer query DOES matter. As you found, these two statements produce very different execution plans:
select top 1 * from sys.objects
select top 1 1 from sys.objects
The first one will be slower because it has to actually return real data. In this case, joining to the three underlying tables: syspalnames, syssingleobjrefs, and sysschobjs.
As to the preference of what you put inside your EXISTS subqueries - SELECT 1 or SELECT * - it doesn't matter. I usually say SELECT 1, but SELECT * is just as good and you'll see it in a lot of Microsoft documentation.
I was looking for an answer to just the actual question contained in the title. I found it at this link:
Select Top 1 or Top n basically returns the first n rows of data based
on the sql query. Select Top 1 1 or Top n s will return the first n
rows with data s depending on the sql query.
For example, the query below produces the first name and last name of
the first 10 matches. This query will return first name and last name
only.
SELECT TOP 10 FirstName, LastName
FROM [tblUser]
where EmailAddress like 'john%'
Now, look at this query with select top 10 'test' - this will produce
the same number of rows as in the previous query (same database, same
condition) but the values will be 'test'.
SELECT TOP 10 'test'
FROM [tblUser]
where EmailAddress like 'john%'
So, select TOP 1 * returns the first row, while select TOP 1 1 returns one row containing just "1". This if the query returns at least one row, otherwise Null will be returned in both cases.
As additional example, this:
SELECT TOP 10 'test', FirstName
FROM [tblUser]
where EmailAddress like 'john%'
will return a table containing a column filled with "test" and another column filled with the first name of the first 10 matches of the query.

SQL select after where clause

Here is the setup:
Table 1: table_1
column_id
column_12
column_13
column_14
Table 2: table_2
column_id
column_21
column_22
Select statement:
DECLARE #Variable
INT SET #Variable = 300
SELECT b.column_id,
b.column_12,
SUM(b.column_13) OVER (PARTITION BY b.column_id ORDER BY b.column_12) AS sum_column_13,
#Variable / nullif(SUM(b.column_13) OVER (PARTITION BY b.column_id ORDER BY b.column_12),0) AS divide_var,
(b.column_13*100) / nullif(b.column_14,0) AS divide_column_3
FROM dbo.table_1 b
WHERE b.column_12 IN ('AM','AJ','A-M','A-J','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q');
This works great, all the formulas are working and the correct results are shown.
b.column_id is retrieved
b.column_12 is retrieved
sum_column_13 is equal to the sum of all the column_13 values (partitioned by column_id)
divide_var is equal to a variable dived by sum_column_13
divide_column_13 is equal to column_13 divided by column_14
Now however I am trying to retrieve the #Variable from table_2, instead of it being static.
Both tables have a column_id, which could link them together. However this value is not unique.
The actual number for #Variable should come from table_2; by summing all the values of column_21 for each column_id.(Something similar sum_column_13)
I can make both things work separately, but when I try to combine them (with a JOIN, or an extra SELECT class) everything goes wild. For example when using the JOIN statement, the WHERE class is solely applied to the JOIN statement and not to the SELECT statement. How I imagine it should go is to use the column_id results from the current SELECT, then use this to retrieve the required data from table_2.
I understand my explanation is not very clear. So here is an SQLFiddle.
As you can see the variable right now comes from adding up the two values in table_2.
Hope this helps.
Thanks,
Here is the sample code, I've not made use of variable instead I'm using the sum of columns directly, also I've made use of CTE:
with tbl_2(col_id, col_sum) as
( select col_id, sum(column_21) col_sum from tbl_2 group by col_id)
SELECT b.column_id,
b.column_12,
SUM(b.column_13) OVER (PARTITION BY b.column_id ORDER BY b.column_12) AS sum_column_13,
col_sum / nullif(SUM(b.column_13) OVER (PARTITION BY b.column_id ORDER BY b.column_12),0) AS divide_var,
(b.column_13*100) / nullif(b.column_14,0) AS divide_column_3
FROM dbo.table_1 b
join tbl_2 on b.col_id=tbl_2.col_id
WHERE b.column_12 IN ('AM','AJ','A-M','A-J','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q');

Does WITH statement execute once per query or once per row?

My understanding of the WITH statement (CTE) is that it executes once per query. With a query like this:
WITH Query1 AS ( ... )
SELECT *
FROM
SomeTable t1
LEFT JOIN Query1 t2 ON ...
If this results in 100 rows, I expect that Query1 was executed only once - not 100 times. If that assumption is correct, the time taken to run the entire query is roughly equal to the time taken to: run Query1 + select from SomeTable + join SomeTable to Query1.
I am in a situation where:
Query1 when run alone takes ~5 seconds (400k rows).
The remainder of the query, after removing the WITH statement and the LEFT JOIN takes ~15 seconds (400k rows).
So, when running the entire query with the WITH statement and the LEFT JOIN in place, I would have expected the query to complete in a timely manner, instead I've let it run for over an hour and once stopped it only got as far as 11k rows.
I am clearly wrong, but why?
Example:
SET NOCOUNT ON;
SET IMPLICIT_TRANSACTIONS ON;
CREATE TABLE MyTable (MyID INT PRIMARY KEY);
GO
INSERT MyTable (MyID)
VALUES (11), (22), (33), (44), (55);
PRINT 'Test MyCTE:';
WITH MyCTE
AS (
SELECT *, ROW_NUMBER()OVER(ORDER BY MyID) AS RowNum
FROM MyTable
)
SELECT *
FROM MyCTE crt
LEFT JOIN MyCTE prev ON crt.RowNum=prev.RowNum+1;
ROLLBACK;
If you run previous script in SSMS (press Ctrl+M -> Actual Execution Plan) then you will get this execution plan for the last query:
In this case, the CTE is executed one time for crt alias and five (!) times for prev alias, once for every row from crt.
So, the answer for this question
Does WITH statement execute once per query or once per row?
is both: once per query (crt) and once per row (prev: once for every for from crt).
To optimize this query, for the start,
1) You can try to store the results from CTE (MyCTE or Query) into a table variable or a temp table
2) Define the primary key of this table as been the join colum(s),
3) Rewrite the final query to use this table variable or temp table.
Off course, you can try to rewrite the final query without this self join between CTE.

Resources