Getting non-deterministic results from WITH RECURSIVE cte - snowflake-cloud-data-platform

I'm trying to create a recursive CTE that traverses all the records for a given ID, and does some operations between ordered records. Let's say I have customers at a bank who get charged a uniquely identifiable fee, and a customer can pay that fee in any number of installments:
WITH recursive payments (
id
, index
, fees_paid
, fees_owed
)
AS (
SELECT id
, index
, fees_paid
, fee_charged
FROM table
WHERE index = 1
UNION ALL
SELECT t.id
, t.index
, t.fees_paid
, p.fees_owed - p.fees_paid
FROM table t
JOIN payments p
ON t.id = p.id
AND t.index = p.index + 1
)
SELECT *
FROM payments
ORDER BY 1,2;
The join logic seems sound, but when I join the output of this query to the source table, I'm getting non-deterministic and incorrect results.
This is my first foray into Snowflake's recursive CTEs. What am I missing in the intermediate result logic that is leading to the non-determinism here?

I assume this is edited code, because in the anchor of you CTE you select the fourth column fee_charged which does not exist, and then in the recursion you don't sum the fees paid and other stuff, basically you logic seems rather strange.
So creating some random data, that has two different id streams to recurse over:
create or replace table data (id number, index number, val text);
insert into data
select * from values (1,1,'a'),(2,1,'b')
,(1,2,'c'), (2,2,'d')
,(1,3,'e'), (2,3,'f')
v(id, index, val);
Now altering you CTE just a little bit to concat that strings together..
WITH RECURSIVE payments AS
(
SELECT id
, index
, val
FROM data
WHERE index = 1
UNION ALL
SELECT t.id
, t.index
, p.val || t.val as val
FROM data t
JOIN payments p
ON t.id = p.id
AND t.index = p.index + 1
)
SELECT *
FROM payments
ORDER BY 1,2;
we get:
ID INDEX VAL
1 1 a
1 2 ac
1 3 ace
2 1 b
2 2 bd
2 3 bdf
Which is exactly as I would expect. So how this relates to your "it gets strange when I join to other stuff" is ether, your output of you CTE is not how you expect it to be.. Or your join to other stuff is not working as you expect, Or there is a bug with snowflake.
Which all comes down to, if the CTE results are exactly what you expect, create a table and join that to your other table, so eliminate some form of CTE vs JOIN bug, and to debug why your join is not working.
But if your CTE output is not what you expect, then lets help debug that.

Related

How does order by work when all column values are identical?

I use SQL Server 2016. Below is the rows in table: test_account. You can see the values of updDtm and fileCreateTime are identical. id is the primary key.
id accno updDtm fileCreatedTime
-----------------------------------------------------------------------
1 123456789 2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000
2 123456789 2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000
3 123456789 2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000
I want to query the latest account id which accno is 123456789 order by updDtm, fileCreatedTime
I run the following SQL, the output result is id = 1
SELECT t.id
FROM
(SELECT
ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC) AS seq,
a.id, a.accno, a.updDtm, a.fileCreatedTime
FROM
test_account a) AS t
WHERE t.seq = 1
My question is does the query result is repeatable and reliable (always output id=1 either run 1 time or multiple times) when the values of columns updDtm and fileCreatedTime are identical or just output the random id?
I read some articles and learn that for MySql and Oracle the query result is not reliable and reproducible. How about SQL Server?
The context of this documentation reference is ORDER BY usage with OFFSET and FETCH but the same considerations apply to all ORDER BY usage, including windowing functions like ROW_NUMBER(). In summary,
To achieve stable results between query requests, the following conditions must be met:
The underlying data that is used by the query must not change.
The ORDER BY clause contains a column or combination of columns that are guaranteed to be unique.
I'm trying to find an case to test if the query would output result
other than id=1 but with no luck
The ordering of rows when duplicate ORDER BY values exist is undefined (a.k.a. non-deterministic and arbitrary) because it depends on the execution plan (which may vary due to available indexes, stats, and the optimizer), parallelism, database engine internals, and even physical data storage. The example below yields different results due to a parallel plan on my test instance.
DROP TABLE IF EXISTS dbo.test_account;
CREATE TABLE dbo.test_account(
id int NOT NULL
CONSTRAINT pk_test_account PRIMARY KEY CLUSTERED
, accno int NOT NULL
, updDtm datetime2 NOT NULL
, fileCreatedTime datetime2 NOT NULL
);
--insert 100K rows
WITH
t10 AS (SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) t(n))
,t1k AS (SELECT 0 AS n FROM t10 AS a CROSS JOIN t10 AS b CROSS JOIN t10 AS c)
,t1g AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS num FROM t1k AS a CROSS JOIN t1k AS b CROSS JOIN t1k AS c)
INSERT INTO dbo.test_account (id, accno, updDtm, fileCreatedTime)
SELECT num, 123456789, '2022-07-27 09:41:10.0000000', '2022-07-27 11:33:33.8300000'
FROM t1g
WHERE num <= 100000;
GO
--run query 10 times
SELECT t.id
FROM
(SELECT
ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC) AS seq,
a.id, a.accno, a.updDtm, a.fileCreatedTime
FROM
test_account a) AS t
WHERE t.seq = 1;
GO 10
Example results:
1
27001
25945
57071
62813
1
1
1
36450
78805
The simple solution is to add the primary key as the last column to the ORDER BY clause to break ties. This returns the same id value (1) in every iteration regardless of the execution plan and indexes.
SELECT t.id
FROM
(SELECT
ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC, a.id) AS seq,
a.id, a.accno, a.updDtm, a.fileCreatedTime
FROM
test_account a) AS t
WHERE t.seq = 1;
GO 10
On a side note, this index will optimize the query:
CREATE NONCLUSTERED INDEX idx ON dbo.test_account(accno, updDtm DESC, fileCreatedTime DESC, id);

Selecting changes in an employees details

I have a table in SQL Server where user is allowed to make changes to the employee's details. Every time a new record is placed in the EMPLOYEE_HIST table. Only the EMP_ID is kept constant for the employee, and all other details are modifiable.
Also there the is a SEQ_NO column which maintains the sequence of entries made.
EMPLOYEE_HIST:
SEQ_NO EMP_ID SOME_VAL1 SOME_VAL2
1 E1 V11 V21 (initial value of this employee)
2 E2 V12 V22 (initial value of this employee)
3 E3 V13 V23 (initial value of this employee)
4 E2 V00 V22
5 E1 V01 V21
6 E2 V02 V22
7 E4 V00 V00 (initial value of this employee)
I want a query which will give me changes made to particular employees, something like
EMP_ID SOME_VAL1_OLD SOME_VAL1_NEW SOME_VAL2_OLD SOME_VAL2_NEW
E1 V11 V01 V21 V21
E2 V12 V00 V22 V22
E2 V00 V02 V22 V22
UPDATE
Also employee details may be modified by user n number of times and for each change, a row should be present in the result set.
Please help.
EDIT:
I finally settled with using LAG function. It will work like this:
SELECT *,ROW_NUMBER() OVER(PARTITION BY EMP_ID,CHANGE_NO ORDER BY EMP_ID,CHANGE_NO,SEQ_NO)
FROM(
SELECT * FROM EMPLOYEE_HIST( SELECT LAG(SOME_VAL1)
OVER(PARTITION BY EMP_ID ORDER BY EMP_ID,SEQ_NO) AS OLD_VAL, SOME_VAL1 AS NEW_VAL, '1' AS CHANGE_NO) T
WHERE OLD_VAL<>NEW_VAL UNION ALL
SELECT * FROM EMPLOYEE_HIST( SELECT LAG(SOME_VAL1) OVER(PARTITION BY EMP_ID ORDER BY EMP_ID,SEQ_NO) AS OLD_VAL, SOME_VAL2 AS NEW_VAL, '2' AS CHANGE_NO) T
WHERE OLD_VAL<>NEW_VAL) TEMP
But the performance is terribly slow for fetching total 500 rows on the table containing 3 million records. Please give some suggestions to improve sorting cost.
You can use a CTE with a Window function if you're using 2008 or newer:
;WITH r AS (
SELECT RANK() OVER (PARTITION BY EMP_ID ORDER BY SEQ_NO DESC) [rank]
, EMP_ID
, SOME_VAL1
, SOME_VAL2
FROM EMPLOYEE_HIST
)
SELECT e.EMP_ID
, s2.SOME_VAL1 [SOME_VAL1_OLD]
, s1.SOME_VAL1 [SOME_VAL1_NEW]
, s2.SOME_VAL2 [SOME_VAL2_OLD]
, s1.SOME_VAL2 [SOME_VAL2_NEW]
FROM (SELECT DISTINCT EMP_ID FROM EMPLOYEE_HIST) AS e
LEFT JOIN r AS s1 ON e.EMP_ID = s1.EMP_ID and s1.rank = 1 --the last change
LEFT JOIN r AS s2 ON e.EMP_ID = s2.EMP_ID and s2.rank = 2 --the second to last change
If you want all of the changes, not just the top two, then you should be able to do something like this:
;WITH r AS (
SELECT RANK() OVER (PARTITION BY EMP_ID ORDER BY SEQ_NO DESC) [rank]
, EMP_ID
, SOME_VAL1
, SOME_VAL2
FROM EMPLOYEE_HIST
)
SELECT e.EMP_ID
, s2.SOME_VAL1 [SOME_VAL1_OLD]
, s1.SOME_VAL1 [SOME_VAL1_NEW]
, s2.SOME_VAL2 [SOME_VAL2_OLD]
, s1.SOME_VAL2 [SOME_VAL2_NEW]
FROM (SELECT DISTINCT EMP_ID FROM EMPLOYEE_HIST) AS e
LEFT JOIN (r AS s1 --the change
INNER JOIN r AS s2 ON s1.EMP_ID = s2.EMP_ID and s2.rank = s1.rank + 1) --previous value
ON e.EMP_ID = s1.EMP_ID
This should enumerate all changes until it encounters the original value.
You could use a CTE to get a partitioned row number, by EMP_ID. Then join that against itself where the row number is offset by 1.
;WITH PartitionedRows
AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY EMP_ID ORDER BY SEQ_NO) AS RowID, EMP_ID, SOME_VAL1,SOME_VAL2
FROM EMPLOYEE_HIST
)
SELECT a.EMP_ID,b.SOME_VAL1 AS SOME_VAL1_OLD,a.SOME_VAL1 AS SOME_VAL1_NEW,b.SOME_VAL2 AS SOME_VAL2_OLD,a.SOME_VAL2 AS SOME_VAL2_NEW
FROM PartitionedRows a
LEFT JOIN PartitionedRows b ON a.EMP_ID = b.EMP_ID AND a.RowID = (b.RowID + 1)
WHERE b.RowID IS NOT NULL
You may be better off with a different data model. You could have a table EMPLOYEE_HIST_OLD that contains the identical data structure. This would allow you to archive the former data (even with a timestamp and/or sequence number), keep the size of the EMPLOYEE_HIST table smaller and w/o data you would not reference regularly, etc. This would allow for a basic join statement between the two tables.
I would then suggest you use the timestamp of the EMPLOYEE_HIST_OLD records to find the most recent modifications, then join those records back to the current records. This will only present to you the changed records. You could limit the query on EMPLOYEE_HIST_OLD to simply return one record (most recent) if you like. SQL query to get most recent row for each instance of a given key
If you must stay within the same EMPLOYEE_HIST table for everything and use the sequence number approach you may wish to use a count() to find changed records for a particular Employee ID and return the values ORDERED by sequence number. You could also limit the query to employees with count > 1. You would then view the data vertically in the table, though. To parse the values into separate columns like VAR1_OLD and VAR1 essentially would require you to only read the last two values and make one record out of two. You lose the visibility of all the changes when trying to view the data horizontally. There could be more than one historical change. To view the records horizontally would require you to do some array manipulation outside of SQL after the data was returned from the query.
For info on counting:
SQL query for finding records where count > 1

SQL Join one-to-many tables, selecting only most recent entries

This is my first post - so I apologise if it's in the wrong seciton!
I'm joining two tables with a one-to-many relationship using their respective ID numbers: but I only want to return the most recent record for the joined table and I'm not entirely sure where to even start!
My original code for returning everything is shown below:
SELECT table_DATES.[date-ID], *
FROM table_CORE LEFT JOIN table_DATES ON [table_CORE].[core-ID] = table_DATES.[date-ID]
WHERE table_CORE.[core-ID] Like '*'
ORDER BY [table_CORE].[core-ID], [table_DATES].[iteration];
This returns a group of records: showing every matching ID between table_CORE and table_DATES:
table_CORE date-ID iteration
1 1 1
1 1 2
1 1 3
2 2 1
2 2 2
3 3 1
4 4 1
But I need to return only the date with the maximum value in the "iteration" field as shown below
table_CORE date-ID iteration Additional data
1 1 3 MoreInfo
2 2 2 MoreInfo
3 3 1 MoreInfo
4 4 1 MoreInfo
I really don't even know where to start - obviously it's going to be a JOIN query of some sort - but I'm not sure how to get the subquery to return only the highest iteration for each item in table 2's ID field?
Hope that makes sense - I'll reword if it comes to it!
--edit--
I'm wondering how to integrate that when I'm needing all the fields from table 1 (table_CORE in this case) and all the fields from table2 (table_DATES) joined as well?
Both tables have additional fields that will need to be merged.
I'm pretty sure I can just add the fields into the "SELECT" and "GROUP BY" clauses, but there are around 40 fields altogether (and typing all of them will be tedious!)
Try using the MAX aggregate function like this with a GROUP BY clause.
SELECT
[ID1],
[ID2],
MAX([iteration])
FROM
table_CORE
LEFT JOIN table_DATES
ON [table_CORE].[core-ID] = table_DATES.[date-ID]
WHERE
table_CORE.[core-ID] Like '*' --LIKE '%something%' ??
GROUP BY
[ID1],
[ID2]
Your example field names don't match your sample query so I'm guessing a little bit.
Just to make sure that I have everything you’re asking for right, I am going to restate some of your question and then answer it.
Your source tables look like this:
table_core:
table_dates:
And your outputs are like this:
Current:
Desired:
In order to make that happen all you need to do is use a subquery (or a CTE) as a “cross-reference” table. (I used temp tables to recreate your data example and _ in place of the - in your column names).
--Loading the example data
create table #table_core
(
core_id int not null
)
create table #table_dates
(
date_id int not null
, iteration int not null
, additional_data varchar(25) null
)
insert into #table_core values (1), (2), (3), (4)
insert into #table_dates values (1,1, 'More Info 1'),(1,2, 'More Info 2'),(1,3, 'More Info 3'),(2,1, 'More Info 4'),(2,2, 'More Info 5'),(3,1, 'More Info 6'),(4,1, 'More Info 7')
--select query needed for desired output (using a CTE)
; with iter_max as
(
select td.date_id
, max(td.iteration) as iteration_max
from #table_dates as td
group by td.date_id
)
select tc.*
, td.*
from #table_core as tc
left join iter_max as im on tc.core_id = im.date_id
inner join #table_dates as td on im.date_id = td.date_id
and im.iteration_max = td.iteration
select *
from
(
SELECT table_DATES.[date-ID], *
, row_number() over (partition by table_CORE date-ID order by iteration desc) as rn
FROM table_CORE
LEFT JOIN table_DATES
ON [table_CORE].[core-ID] = table_DATES.[date-ID]
WHERE table_CORE.[core-ID] Like '*'
) tt
where tt.rn = 1
ORDER BY [core-ID]

SQL: Is this JOIN over a player table and a subquery-created high score table the best way to perform this operation?

I have a query the I believe (not fully tested, only sanity-checked) works, but seems incredibly overwrought to me. It's a join over three tables.
One table is a quest table with questID, playerID, and score. Another table is a player table that contains the territory the player is in. A third table is not a real table; it's a subquery that produces a temporary result that consists of the highest scoring player for each territory.
SELECT
Q1.QuestID, Q1.PersonID, C1.TerritoryID
FROM
[dbo].[T_QuestSys_Quest] AS Q1
JOIN
[GZ_GAME].[dbo].[T_Character] AS C1 ON Q1.PersonID = C1.PersonID
JOIN
(SELECT
TerritoryID, MAX(Score) AS HighScore
FROM
[GZ_GAME].[dbo].[T_QuestSys_Quest] AS Q2
JOIN
[GZ_GAME].[dbo].[T_Character] AS C2 ON Q2.PersonID = C2.PersonID
AND Q2.QuestID = #QuestID
GROUP BY
TerritoryID) AS S ON Q1.Score = S.HighScore
AND C1.TerritoryID = S.TerritoryID
AND Q1.QuestID = #QuestID
Notably, the subquery would not allow me add additional terms on the SELECT statement resulting in an error:
invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause").
It seemed to me that the MAX(Score) would be sufficiently disambiguating as to which PersonID I wanted, but I guess not.
Anyway, my questions are: Is there a better way to do this in terms of elegance/simplicity? Is there a better way to do this in terms of performance?
One method use ROW_NUMBER window function
;WITH cte
AS (SELECT *,Rwo_number()OVER(partition BY territoryid ORDER BY score DESC) RN
FROM [GZ_GAME].[dbo].[t_questsys_quest] AS Q2
JOIN [GZ_GAME].[dbo].[t_character] AS C2
ON Q2.personid = C2.personid
AND Q2.questid = #QuestID)
SELECT *
FROM cte
WHERE rn = 1

How to develop a recursive CTE in T-SQL?

I am new to recursive CTEs. I am trying to develop a CTE which will return all of the employees under each manager name. So I have two tables: people_rv and staff_rv
People_rv table contains all of the people, both managers and employees. Staff_rv only contains manager information. Uniqueidentifier staff values are stored in Staff_rv. Uniqueidentifier employee values are stored in people_rv. People_rv contains varchar first and last name values for both managers and employees.
But when I run the following CTE I get an error:
WITH
cteStaff (ClientID, FirstName, LastName, SupervisorID, EmpLevel)
AS
(
SELECT p.people_id, p.first_name, p.last_name, s.supervisor_id,1
FROM people_rv p JOIN staff_rv s on s.people_id = p.people_id
WHERE s.supervisor_id = '95E16819-8C3A-4098-9430-08F0E3B764E1'
UNION ALL
SELECT p2.people_id, p2.first_name, p2.last_name, s2.supervisor_id, r.EmpLevel + 1
FROM people_rv p2 JOIN staff_rv s2 on s2.people_id = p2.people_id
INNER JOIN cteStaff r on s2.staff_id = r.ClientID
)
SELECT
FirstName + ' ' + LastName AS FullName,
EmpLevel,
(SELECT first_name + ' ' + last_name FROM people_rv p join staff_rv s on s.people_id = p.people_id
WHERE s.staff_id = cteStaff.SupervisorID) AS Manager
FROM cteStaff
OPTION (MAXRECURSION 0);
My output is:
Barbara G 1 Melanie K
Dawn P 1 Melanie K
Garrett M 1 Melanie K
Stephanie P 1 Melanie K
Amanda F 1 Melanie K
Amanda T 1 Melanie K
Stephanie G 1 Melanie K
Carlos H 1 Melanie K
So it is not iterating any more than the first level. Why not?
Melanie is the top most supervisor, but each of the persons in the leftmost column are also supervisors. So this query should also return level 2.
You may be in an infinite loop with your join. I would check how many levels you expect the table to actually go down. Generally you join a recursion on something similar to do
ID = ParentID
of something either contained in a table or in an expression. Keep in mind you can also create a CTE prior to a recursive CTE if you have to make up your relationship.
Here is an example that will self execute, it may help.
Declare #table table ( PersonId int identity, PersonName varchar(512), Account int, ParentId int, Orders int);
insert into #Table values ('Brett', 1, NULL, 1000),('John', 1, 1, 100),('James', 1, 1, 200),('Beth', 1, 2, 300),('John2', 2, 4, 400);
select
PersonID
, PersonName
, Account
, ParentID
from #Table
; with recursion as
(
select
t1.PersonID
, t1.PersonName
, t1.Account
--, t1.ParentID
, cast(isnull(t2.PersonName, '')
+ Case when t2.PersonName is not null then '\' + t1.PersonName else t1.PersonName end
as varchar(255)) as fullheirarchy
, 1 as pos
, cast(t1.orders +
isnull(t2.orders,0) -- if the parent has no orders than zero
as int) as Orders
from #Table t1
left join #Table t2 on t1.ParentId = t2.PersonId
union all
select
t.PersonID
, t.PersonName
, t.Account
--, t.ParentID
, cast(r.fullheirarchy + '\' + t.PersonName as varchar(255))
, pos + 1 -- increases
, r.orders + t.orders
from #Table t
join recursion r on t.ParentId = r.PersonId
)
, b as
(
select *, max(pos) over(partition by PersonID) as maxrec -- I find the maximum occurrence of position by person
from recursion
)
select *
from b
where pos = maxrec -- finds the furthest down tree
-- and Account = 2 -- I could find just someone from a different department
Your problem as far as I can tell is is you have no join connecting managers to their employees.
This join
INNER JOIN cteStaff r on r.StaffID = s2.staff_id
Just joins the same initial level 1 staffer back to himself.
UPDATE:
Still not quite right! You have a supervisor_id, but again you're still not actually using that to join back to the CTE.
So for each recursion of this CTE you need to (excluding the name join):
select {Level 1 Boss}, NULL (no supervisor)
union
select {new employee}, {that employee's boss}
So the join must connect the CTE's ClientID (the level 1 boss) to the second UNION query's supervisor field, which looks to be supervisor_id , not staff_id.
The JOIN to accomplish this second task is (from what I can tell of your staff_rv table schema):
SELECT p2.people_id, p2.first_name, p2.last_name, s2.supervisor_id, r.EmpLevel + 1
FROM people_rv p2 JOIN staff_rv s2 on s2.people_id = p2.people_id
INNER JOIN cteStaff r on s2.supervisor_id = r.ClientID
Note the bottom join joins the r.ClientID (the level 1 boss) to the staffer's supervisor_id field.
(NB: I think your staff_id and supervisor_id's mimic your people_id values from the people_rv table, so this join should work fine. But if they are different (i.e. a staffer's supervisor_id isn't that supervisor's people_id) then you'll need to write the join such that the staffer's supervisor_id can be joined to their people_id you're storing as ClientID in the CTE.)
Here's a good simple Recursive CTE to review (it may not be the answer, but someone else searching on how to make a recursive CTE may need it):
-- Recursive CTE
;
WITH Years ( myYear )
AS (
-- Base case
SELECT DATEPART(year, GETDATE())
UNION ALL
-- Recursive
SELECT Years.myYear - 1
FROM Years
WHERE Years.myYear >= 2002
)
SELECT *
FROM Years
Note that this probably won't solve your problem, but is a means to hopefully seeing where you're going wrong in the original query.
The default is 100 levels of recursion - you can set it to unlimited by using the MAXRECURSION query hint where you're selecting from your CTE:
...
FROM cteStaff
OPTION (MAXRECURSION 0);
From MSDN:
MAXRECURSION number
Specifies the maximum number of recursions allowed for this query. number is a nonnegative integer between 0 and 32767. When 0 is
specified, no limit is applied. If this option is not specified, the
default limit for the server is 100.
When the specified or default number for MAXRECURSION limit is reached during query execution, the query is ended and an error is
returned.
Because of this error, all effects of the statement are rolled back. If the statement is a SELECT statement, partial results or no
results may be returned. Any partial results returned may not include
all rows on recursion levels beyond the specified maximum recursion
level.

Resources