Need help write a loop in snowflake - snowflake-cloud-data-platform

I have data like below in the table. component part no shows the part no replaced for Part no.
Table having data
I want to write a code where I get the last part i.e. the latest part. The loop ends when Part doesnt return anything.
I want to show the data like below:
How data is needed
I tried using recursive CTE but the data is huge in table thatit keeps on running for 2 hours.
I am weak in writing stored procedure.
Any way we can achieve it? We are okay if it completes in 1 hour.

If we need to analyze the level of nesting, CTE is a good solution. The key is to choose the starting point right. Only the roots. So that there will be no infinite loops or duplicate results.
If the CTE takes too long and there is too much data, maybe try to scale up the warehouse or divide the data into batches.
The CTE should look something like this:
CREATE OR REPLACE TABLE T1 (
PART_NO STRING,
COMPOMENT_NO STRING);
INSERT INTO T1 (PART_NO, COMPOMENT_NO)
VALUES ('9U8806', '1252127'),
('1252127', '1073295'),
('1073295', '1386464'),
('1386464', '2320160'),
('2320160', '3153441');
WITH CTE AS (
SELECT T1.PART_NO AS ORIGINAL_PART_NO, T1.PART_NO, T1.PART_NO AS PREVIOUS_PART_NO, 1 AS PART_LEVEL
FROM T1
WHERE T1.PART_NO NOT IN (SELECT COMPOMENT_NO FROM T1) -- only roots
UNION ALL
SELECT CTE.ORIGINAL_PART_NO, T1.COMPOMENT_NO AS PART_NO, CTE.PART_NO AS PREVIOUS_PART_NO, CTE.PART_LEVEL + 1 AS PART_LEVEL
FROM T1
JOIN CTE ON CTE.PART_NO = T1.PART_NO
)
SELECT *
FROM CTE;

Related

snowflake merge statement using golden gate json as source table

while executing target table in snowflake using json data as source table
merge into cust tgt using (
select parse_json(s.$1):application_num as application num
from prd_json s qualify
row_number() over(partition application
order_by application desc)=1) src
on tgt.application =src.application
when not matched and op_type='I' then
insert(application) values (src.application );
qualify commands ignores all the duplicate data present and gives only unique record but while putting joins its show only less records when compare to normal select statement.
for example :
select distinct application
from prd_json where op_type='I';
--15000 rows are there
while putting joins it shows there is not matching records in target . if it is not matched it should insert all 15000rows but 8500 rows only inserting even though it was not an duplicate record . is there any function available without using "qualify" shall we insert the record. if i ignore qualify am getting dml error duplication. pls guide me if anyone knows.
How about using SELECT DISTINCT?
You demo SQL does not compile. and you using the $1 means it's also hard to guess the names of your columns to know how the ROW_NUMBER is working.
So it's hard to nail down the problem.
But with the following SQL you can replace ROW_NUMBER with DISTINCT
CREATE TABLE cust(application INT);
CREATE OR REPLACE table prd_json as
SELECT parse_json(column1) as application, column2 as op_type
FROM VALUES
('{"application_num":1,"other":1}', 'I'),
('{"application_num":1,"other":2}', 'I'),
('{"application_num":2,"other":3}', 'I'),
('{"application_num":1,"other":1}', 'U')
;
MERGE INTO cust AS tgt
USING (
SELECT DISTINCT
parse_json(s.$1):application_num::int as application,
s.op_type
FROM prd_json AS s
) AS src
ON tgt.application = src.application
WHEN NOT MATCHED AND src.op_type = 'I' THEN
INSERT(application) VALUES (src.application );
number of rows inserted
2
SELECT * FROM cust;
APPLICATION
1
2
running the MERGE code a second time gives:
number of rows inserted
0
Now if truncate CUST and I swap to using this SQL for the inner part:
SELECT --DISTINCT
parse_json(s.$1):application_num::int as application,
s.op_type
FROM prd_json AS s
qualify row_number() over (partition by application order by application desc)=1
I get three rows inserted, because the partition by application, is effectively binding to the s.application not the output application, and there are 3 different "applications" because of the other values.
The reason I wrote my code this way is your
select distinct application
from prd_json where op_type='I';
implies there is something called application already, in the table.. and thus it runs the chance of being used in the ROW_NUMBER statement..
Anyways, there is a large possible problem is you also have "update data" I guess U in your transaction block, that you want to ORDER BY the sub-select so you never have a Inser,Update trying action in Update,Inser order. And assuming you want all update operations if there are many of them.. I will stop. But if you do not have Updates, the sub-select should have the op_type='I' to avoid the non-insert ops making it. Out, or possible worse again, in your ROW_NUMBER pattern replacing the Intserts. Which I suspect is the underlying cause of your problem.

SQL Server - UNION with WHERE clause outside is extremely slow on simple join

I have a simple query and it works fast (<1sec):
;WITH JointIncomingData AS
(
SELECT A, B, C, D FROM dbo.table1
UNION ALL
SELECT A, B, C, D FROM dbo.table2
)
SELECT *
FROM JointIncomingData D
WHERE a = '1/1/2020'
However, if I join with another small table in the final SELECT statement it is extremely slow (> 30 sec)
DECLARE #anotherTable TABLE (A DATE, B INT)
INSERT INTO #anotherTable (AsOfDate, FundId)
VALUES ('1/1/2020', 1)
;WITH JointIncomingData AS
(
SELECT A, B, C, D FROM dbo.table1
UNION ALL
SELECT A, B, C, D FROM dbo.table2
)
SELECT *
FROM JointIncomingData D
JOIN #anotherTable T ON T.A = D.A AND T.B = D.B
In the real application, I have a complex UPDATE as the final statement, so I try to avoid copy-paste and introduces UNION to consolidate code.
But now experience an unexpected issue with slowness.
I tried using UNION ALL instead of UNION - with the same result.
Looks like SQL Server pushed simple conditions to each of UNION statements, but when I join it with another table, it doesn't happen and a table scan occurs.
Any advice?
UPDATE: Here is estimated plans
for the first simple condition query: https://www.brentozar.com/pastetheplan/?id=SJ5fynTgP
for the query with a join table: https://www.brentozar.com/pastetheplan/?id=H1eny3pxP
Please keep in mind that estimated plans are not exactly for the above query, but more real one, having exactly the same problem.
When I'm doing complex updates I normally declare a temp table and insert the rows into it that I intend to update. There's two benefits to this approach, one being that by explicitly collecting the rows to be updated you simplify the logic and make the update itself really simple (just update the rows whose primary key is in your temp table). The other big benefit of it is you can do some sanity checking before actually running your update, and "throw an error" by returning a different value.
I think it's generally a good practice to break down queries into simple steps like this, because it makes them much easier to troubleshoot in the future.
Based on the "similar" execution plan you shared. It would also be better to have the actual plan, to know if your estimates and memory grants are ok.
Key lookup
The index IX_dperf_date_fund should be extended to INCLUDE the following columns nav, equity
Why? Every row the index returns, create a lookup in the clusterd index to retrieve the column values of nav, equity.
Only if this is reasonable for the application, if other queries may benefit as well
CTE
Change your CTE to a temp table.
Example:
SELECT *
INTO #JointIncomingData
FROM (
SELECT AsOfDate, FundId, DataSourceId, ShareClass, NetAssetsBase, SharesOutstanding
FROM
ETL.tblIncomingData
UNION ALL
SELECT AsOfDate, FundId, DataSourceId, ShareClass, NetAssetsBase, SharesOutstanding
FROM ETL.vIncomingDataDPerf
) x
Why? CTE's are not materialized. and this answer
Bonus: parameter sniffing
If you pass in parameters you might be suffering from parameters sniffing.

Temp Table with Wild Card

I need to clean up some observations in a table that are inaccurate prior to joining to the after mentioned table, this will avoid duplicate observation output.
I validated that the max(date_value) removes the 9K inaccurate transactions ..... newer transaction were completed which fixed the problem.
The code below, without into #temp, fixes the issue but as soon as I add a temp table, I get a syntax error will not execute, I need like 20 variables out of the table and really don't feel like listing them all, must be a simple syntax or alternative method.
SELECT * INTO #temp FROM db.dbo.table WHERE MAX(date_value);
SELECT a.* INTO #temp
FROM table a
inner join (select id, max(created_at) as max_created
from db.table
group by id) b
on a.id = b.id

CROSS APPLY with table valued function restriction performance

I have problem with CROSS APPLY with parametrised table valued function.
Here is simplified pseudo code example:
SELECT *
FROM (
SELECT lor.*
FROM LOT_OF_ROWS_TABLE lor
WHERE ...
) AS lor
CROSS APPLY dbo.HeavyTableValuedFunction(lor.ID) AS htvf
INNER JOIN ANOTHER_TABLE AS at ON lor.ID = at.ID
WHERE ...
Inner select on table LOT_OF_ROWS_TABLE is returning many rows.
Joining tables LOT_OF_ROWS_TABLE and ANOTHER_TABLE returns only one or few rows.
Table valued function is very time consuming and when calling for a lot of
rows the select lasts very long time.
My problem:
The function is called for all rows returned from LOT_OF_ROWS_TABLE regardless of the fact that the data will be limited when just join ANOTHER_TABLE.
The select has to be in the shown format - it is generated and in fact it is much more dificult.
When I try to rewrite it, it can be very fast, but it cannot be rewritten like this:
SELECT *
FROM (
SELECT lor.*
FROM LOT_OF_ROWS_TABLE lor
WHERE ...
) AS lor
INNER JOIN ANOTHER_TABLE AS at ON lor.ID = at.ID
CROSS APPLY dbo.HeavyTableValuedFunction(at.ID) AS htvf
WHERE ...
I'd like to know:
Is there any setting or hint or something that forces select to call function only for finally restricted rows?
Thank you.
EDIT:
The table valued function is very complex: http://pastebin.com/w6azRvxR.
The select we are talking about is "user configured" and generated: http://pastebin.com/bFbanY2n.
you can divide this query into 2 parts use either table variable or temp table
SELECT lor.*,at.* into #tempresult
FROM (
SELECT lor.*
FROM LOT_OF_ROWS_TABLE lor
WHERE ...
) lor
INNER JOIN ANOTHER_TABLE AS at ON lor.ID = at.ID
WHERE ...
now do the time consuming part which is table valued function right
SELECT * FROM #tempresult
CROSS APPLY dbo.HeavyTableValuedFunction(#tempresult.ID) AS htvf
I believe this is what you are looking for.
Plan Forcing Scenario: Create a Plan Guide to Force a Plan Obtained from a Rewritten Query
Basically it describes re-writing the query to get a generated plan using the correct order of joins. Then saving off that plan and forcing your existing query (that does not get changed) to use the plan you saved off.
The BOL link I put in even gives a specific example of re-writing the query putting the joins in a different order and using a FORCE ORDER hint. Then using sp_create_plan_guild to take the plan from the re-written query and use it on the original query.
YES and NO... it's hard to interprit what you're trying to achieve without sample data IN and result OUT, to compare outcomes.
I'd like to know:
Is there any setting or hint or something that forces select to call
function only for finally restricted rows?
So I'll answer your question above (3 years later!!) directly, with a direct statement:
You need to learn about CTE and the difference between CROSS APPLY
compared to INNER JOIN and why using CROSS APPLY in your case is
necessary. You "could" take the code in your function and apply it
into a single SQL statement using CTE.
ie:
Read this and this.
Essentially, something like this...
WITH t2o AS
(
SELECT t2.*, ROW_NUMBER() OVER (PARTITION BY t1_id ORDER BY rank) AS rn
FROM t2
)
SELECT t1.*, t2o.*
FROM t1
INNER JOIN
t2o
ON t2o.t1_id = t1.id
AND t2o.rn <= 3
Apply your query to extrapolate the date you want ONCE, and using CTE, then apply your second SQL using the CROSS APPLY.
You have no choice. You cannot do what you're trying to do in ONE SQL.

Optimizing ROW_NUMBER() in SQL Server

We have a number of machines which record data into a database at sporadic intervals. For each record, I'd like to obtain the time period between this recording and the previous recording.
I can do this using ROW_NUMBER as follows:
WITH TempTable AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Machine_ID ORDER BY Date_Time) AS Ordering
FROM dbo.DataTable
)
SELECT [Current].*, Previous.Date_Time AS PreviousDateTime
FROM TempTable AS [Current]
INNER JOIN TempTable AS Previous
ON [Current].Machine_ID = Previous.Machine_ID
AND Previous.Ordering = [Current].Ordering + 1
The problem is, it goes really slow (several minutes on a table with about 10k entries) - I tried creating separate indicies on Machine_ID and Date_Time, and a single joined-index, but nothing helps.
Is there anyway to rewrite this query to go faster?
The given ROW_NUMBER() partition and order require an index on (Machine_ID, Date_Time) to satisfy in one pass:
CREATE INDEX idxMachineIDDateTime ON DataTable (Machine_ID, Date_Time);
Separate indexes on Machine_ID and Date_Time will help little, if any.
How does it compare to this version?:
SELECT x.*
,(SELECT MAX(Date_Time)
FROM dbo.DataTable
WHERE Machine_ID = x.Machine_ID
AND Date_Time < x.Date_Time
) AS PreviousDateTime
FROM dbo.DataTable AS x
Or this version?:
SELECT x.*
,triang_join.PreviousDateTime
FROM dbo.DataTable AS x
INNER JOIN (
SELECT l.Machine_ID, l.Date_Time, MAX(r.Date_Time) AS PreviousDateTime
FROM dbo.DataTable AS l
LEFT JOIN dbo.DataTable AS r
ON l.Machine_ID = r.Machine_ID
AND l.Date_Time > r.Date_Time
GROUP BY l.Machine_ID, l.Date_Time
) AS triang_join
ON triang_join.Machine_ID = x.Machine_ID
AND triang_join.Date_Time = x.Date_Time
Both would perform best with an index on Machine_ID, Date_Time and for correct results, I'm assuming that this is unique.
You haven't mentioned what is hidden away in * and that can sometimes means a lot since a Machine_ID, Date_Time index will not generally be covering and if you have a lot of columns there or they have a lot of data, ...
If the number of rows in dbo.DataTable is large then it is likely that you are experiencing the issue due to the CTE self joining onto itself. There is a blog post explaining the issue in some detail here
Occasionally in such cases I have resorted to creating a temporary table to insert the result of the CTE query into and then doing the joins against that temporary table (although this has usually been for cases where a large number of joins against the temp table are required - in the case of a single join the performance difference will be less noticable)
I have had some strange performance problems using CTEs in SQL Server 2005. In many cases, replacing the CTE with a real temp table solved the problem.
I would try this before going any further with using a CTE.
I never found any explanation for the performance problems I've seen, and really didn't have any time to dig into the root causes. However I always suspected that the engine couldn't optimize the CTE in the same way that it can optimize a temp table (which can be indexed if more optimization is needed).
Update
After your comment that this is a view, I would first test the query with a temp table to see if that performs better.
If it does, and using a stored proc is not an option, you might consider making the current CTE into an indexed/materialized view. You will want to read up on the subject before going down this road, as whether this is a good idea depends on a lot of factors, not the least of which is how often the data is updated.
What if you use a trigger to store the last timestamp an subtract each time to get the difference?
If you require this data often, rather than calculate it each time you pull the data, why not add a column and calculate/populate it whenever row is added?
(Remus' compound index will make the query fast; running it only once should make it faster still.)

Resources