Workaround for Snowflake compiler errors - snowflake-cloud-data-platform

Sometimes the Snowflake SQL compiler tries to be too smart for its own good. This is a follow-up to a previous question here, where a clever solution was provided for my given use-case, but have run into some limitations for that solution.
A brief background; I have a JS-UDTF that takes 3 float arguments to return rows representing a series GENERATE_SERIES(FLOAT,FLOAT,FLOAT), and a SQL-UDTF GENERATE_SERIES(INT,INT,INT) that cast the params to floats, invokes the JS-UDTF, and then the result back to ints. My original version for this wrapper UDTF was:
CREATE OR REPLACE FUNCTION generate_series(FIRST_VALUE INTEGER, LAST_VALUE INTEGER, STEP_VALUE INTEGER)
RETURNS TABLE (GS_VALUE INTEGER)
AS
$$
SELECT GS_VALUE::INTEGER AS GS_VALUE FROM table(generate_series(FIRST_VALUE::DOUBLE,LAST_VALUE::DOUBLE,STEP_VALUE::DOUBLE))
$$;
Which would fail in most conditions where the input were not constants, e.g.:
WITH report_params AS (
SELECT
1::integer as first_value,
3::integer as last_value,
1::integer AS step_value
)
SELECT
*
FROM
report_params, table(
generate_series(
first_value,
last_value,
step_value
)
)
Would return error:
SQL compilation error: Unsupported subquery type cannot be evaluated
The provided solution to trick the SQL compiler to behave was to encapsulate the function params into a VALUES table and cross-join the inner UDTF:
CREATE OR REPLACE FUNCTION generate_series_int(FIRST_VALUE INTEGER, LAST_VALUE INTEGER, STEP_VALUE INTEGER)
RETURNS TABLE (GS_VALUE INTEGER)
AS
$$
SELECT GS_VALUE::INTEGER AS GS_VALUE
FROM (VALUES (first_value, last_value, step_value)),
table(generate_series(first_value::double,last_value::double,step_value::double))
$$;
This worked lovely for most invocations, however I've discovered a situation where the SQL compiler is at it again. Here is a simplified example that reproduces the problem:
WITH report_params AS (
SELECT
1::integer AS first_value,
DATEDIFF('DAY','2020-01-01'::date,'2020-02-01'::date)::integer AS last_value,
1::integer AS step_value
)
SELECT
*
FROM
report_params, table(
COMMON.FN.generate_series(
first_value,
last_value,
step_value
)
);
This results in the error:
SQL compilation error: Invalid expression [CORRELATION(SYS_VW.LAST_VALUE_3)] in VALUES clause
The error seems obvious enough (I think) that the compiler is trying to embed the function code into the outer queries treating the function like a macro before runtime.
The answer at this point might just be that I am asking too much out of Snowflake's current capabilities, but in the interest of learning and continuing to build out what I think is a very helpful UDF library, am curious if there is a solution I am missing.

The major problem is you have written a correlated sub query.
WITH report_params AS (
SELECT * FROM VALUES
(1, 30, 1)
v(first_value,last_value, step_value)
)
SELECT
*
FROM
report_params, table(
COMMON.FN.generate_series(
first_value,
last_value,
step_value
)
);
as when you add a second row to your CTE
WITH report_params AS (
SELECT * FROM VALUES
(1, 30, 1),
(2, 40, 2)
v(first_value,last_value, step_value)
)
SELECT
*
FROM
report_params, table(
COMMON.FN.generate_series(
first_value,
last_value,
step_value
)
);
it becomes more obvious this is correlated, which is not so obvious who snowflake should execute it.
which for the above data would ideal look like (if it was valid SQL)
WITH report_params AS (
SELECT *
,mod(v.first_value,v.step_value) as mod_offset
FROM VALUES
(0, 5, 20, 1),
(1, 3, 15, 3),
(2, 4, 15, 3),
(3, 5, 15, 3)
v(id, first_value,last_value, step_value)
), report_ranges AS (
SELECT min(first_value) as mmin,
max(last_value) as mmax
FROM report_params
WHERE first_value <= last_value AND step_value > 0
), all_range AS (
SELECT
row_number() over (order by seq8()) + rr.mmin - 1 as seq
FROM report_ranges rr,
TABLE(GENERATOR( ROWCOUNT => (rr.mmax - rr.mmin) + 1 ))
)
SELECT
ar.seq
,rp.id, rp.first_value, rp.last_value, rp.step_value, rp.mod_offset
FROM all_range as ar
JOIN report_params as rp ON ar.seq BETWEEN rp.first_value AND rp.last_value AND mod(ar.seq, rp.step_value) = rp.mod_offset
ORDER BY 2,1;
but if your generating it in a stored procedure (or externally) could be substituted into
WITH report_params AS (
SELECT *
,mod(v.first_value,v.step_value) as mod_offset
FROM VALUES
(0, 5, 20, 1),
(1, 3, 15, 3),
(2, 4, 15, 3),
(3, 5, 15, 3)
v(id, first_value,last_value, step_value)
), all_range AS (
SELECT
row_number() over (order by seq8()) + 3 /*min*/ - 1 as seq
FROM TABLE(GENERATOR( ROWCOUNT => (20/*max*/ - 3/*min*/) + 1 ))
)
SELECT
ar.seq
,rp.id
,rp.first_value, rp.last_value, rp.step_value, rp.mod_offset
FROM all_range as ar
JOIN report_params as rp ON ar.seq BETWEEN rp.first_value AND rp.last_value AND mod(ar.seq, rp.step_value) = rp.mod_offset
ORDER BY 2,1;
giving:
SEQ ID FIRST_VALUE LAST_VALUE STEP_VALUE MOD_OFFSET
5 0 5 20 1 0
6 0 5 20 1 0
7 0 5 20 1 0
8 0 5 20 1 0
9 0 5 20 1 0
10 0 5 20 1 0
11 0 5 20 1 0
12 0 5 20 1 0
13 0 5 20 1 0
14 0 5 20 1 0
15 0 5 20 1 0
16 0 5 20 1 0
17 0 5 20 1 0
18 0 5 20 1 0
19 0 5 20 1 0
20 0 5 20 1 0
3 1 3 15 3 0
6 1 3 15 3 0
9 1 3 15 3 0
12 1 3 15 3 0
15 1 3 15 3 0
4 2 4 15 3 1
7 2 4 15 3 1
10 2 4 15 3 1
13 2 4 15 3 1
5 3 5 15 3 2
8 3 5 15 3 2
11 3 5 15 3 2
14 3 5 15 3 2
The problem I cannot guess at, is it feels like you ether trying to hide some complexity behind the table functions JS functions, or have made thing over complex for an unstated reason.
[edit speaking to the 1-9 comment]
the major difference between a generate_series and GENERATOR is the former is almost a UDF or CTE and in snowflake you have to have the GENERATOR in it own sub-select or you will get messed up results.
with s1 as (
SELECT
row_number() over (order by seq8()) -1 as seq
FROM
TABLE(GENERATOR( ROWCOUNT => 3 ))
), s2 as (
SELECT
row_number() over (order by seq8()) -1 as seq
FROM
TABLE(GENERATOR( ROWCOUNT => 3 ))
)
select s1.seq as a, s2.seq as b
from s1, s2
order by 1,2;
gives 9 rows of the two data mixed, like you not you want.
where-as
with s1 as (
SELECT
row_number() over (order by seq8()) -1 as seq
FROM
TABLE(GENERATOR( ROWCOUNT => 3 ))
)
SELECT
row_number() over (order by seq8()) -1 as a
,s1.seq as b
FROM
TABLE(GENERATOR( ROWCOUNT => 3 )), s1;
give 1-9, because the GENERATOR (the creator of rows) has been crossed with the other data, before the sequence code has run..
Another version of the original solution provided, is
WITH report_params AS (
SELECT *
,trunc(div0((last_value-first_value),step_value)) as steps
FROM VALUES
(0, 5, 20, 1),
(1, 3, 15, 3),
(2, 4, 15, 3),
(3, 5, 15, 3)
v(id, first_value,last_value, step_value)
), large_range AS (
SELECT
row_number() over (order by seq8()) -1 as seq
FROM
TABLE(GENERATOR( ROWCOUNT => 1000 ))
)
select rp.id
,rp.first_value + (lr.seq*rp.step_value) as val
from report_params as rp
join large_range as lr on lr.seq <= rp.steps
order by 1,2;
which I like more as the nature of the mixing is more clear. But it still speaks to the mindset difference between snowflake and other RDB. In postgress there is no cost to doing per-row operations, because it was born of an era where it was all per-row operations, but snowflake has no per-row options, and because it cannot do things on each row, it can do many rows independently. It means all expressions of per-row, need to be moved to the front and then joined. Thus what the above is trying to show.

Related

SQL Server script not working as expected

I have this little script that shall return the first number in a column of type int which is not used yet.
SELECT t1.plu + 1 AS plu
FROM tovary t1
WHERE NOT EXISTS (SELECT 1 FROM tovary t2 WHERE t2.plu = t1.plu + 1)
AND t1.plu > 0;
this returns the unused numbers like
3
11
22
27
...
The problem is, that when I make a simple select like
SELECT plu
FROM tovary
WHERE plu > 0
ORDER BY plu ASC;
the results are
1
2
10
20
...
Why the first script isn't returning some of free numbers like 4, 5, 6 and so on?
Compiling a formal answer from the comments.
Credit to Larnu:
It seems what the OP really needs here is an (inline) Numbers/Tally (table) which they can then use a NOT EXISTS against their table.
Sample data
create table tovary
(
plu int
);
insert into tovary (plu) values
(1),
(2),
(10),
(20);
Solution
Isolating the tally table in a common table expression First1000 to produce the numbers 1 to 1000. The amount of generated numbers can be scaled up as needed.
with First1000(n) as
(
select row_number() over(order by (select null))
from ( values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0) ) a(n) -- 10^1
cross join ( values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0) ) b(n) -- 10^2
cross join ( values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0) ) c(n) -- 10^3
)
select top 20 f.n as Missing
from First1000 f
where not exists ( select 'x'
from tovary
where plu = f.n);
Using top 20 in the query above to limit the output. This gives:
Missing
-------
3
4
5
6
7
8
9
11
12
13
14
15
16
17
18
19
21
22
23
24

Calculating the stddev and avg between the most recent number and all the other numbers in a running list snowflake

I have a dataset that looks something like this:
id committed delivered timestamp stddev
1 10 8 01-02-2022 ?
2 20 15 01-14-2022 ?
3 12 12 01-30-2022 ?
4 2 0 02-14-2022 ?
.
.
99 null
I am trying to calculate the standard deviation between sprint x and all the sprints after sprint x; for example, the standard deviation and avg between sprint 1, 2, 3 & 4, 2, 3 & 4, 3 & 4, and so on. If there are no records after 4, that stddev would be null
With the current snowflake functions, I am generally unable to calculate the stddev in general, let alone do something with a lag/lead function.
Does anyone have any advice? Thanks in advance!
Update:
I've figured out how to calculate a moving avg over sprint x and the next sprint, but not for all previous sprints:
(delivered + lead(delivered) over (partition by id order by timestamp asc)) / 2
stddev can also be calculated using abs / sqrt (2)
You're looking for a frame clause -- this is part of the window function that can specify which rows in the current partition to use in the calculation.
select
id,
stddev(delivered) over (
order by id asc
rows between current row and unbounded following
) as stddev,
avg(delivered) over (
order by id asc
rows between current row and unbounded following
) as avg
from my_data
tconbeer is 100% correct, but here is the code and count to "show it working"
and you example data (moshed) into a VALUES section to avoid making a table.
I also stripped out timestamp, as it's not used in this demo, but normally I would order by that, but I could not see the pattern, so just dropped it, as it's non material to the example.
SELECT t.*
,count(delivered) over ( order by id asc rows between current row and unbounded following ) as _count
,stddev(delivered) over ( order by id asc rows between current row and unbounded following ) as stddev
,avg(delivered) over ( order by id asc rows between current row and unbounded following ) as avg
FROM VALUES
(1, 10, 8),
(2, 20, 15),
(3, 12, 12),
(4, 2, 0),
(5, 2, 0),
(6, 0, 0)
t(id, committed, delivered)
ORDER BY 1;
gives:
ID
COMMITTED
DELIVERED
_COUNT
STDDEV
AVG
1
10
8
6
6.765106577
5.833
2
20
15
5
7.469939759
5.4
3
12
12
4
6
3
4
2
0
3
0
0
5
2
0
2
0
0
6
0
0
1
null
0
you can create a dummy table which will have a id generated sequentially using the generator function for a particular range and do a LEFT join with the table. This way you will get rows with NULL values where the id is not present, and then you can use lag / leap to get the average.
--- untested
select seq4() as id1 , TMP.* from table(generator(rowcount => 10)) v
LEFT JOIN (SELECT * FROM (
SELECT 1 as id , 8 As committed , '01-02-2022' as delivered UNION ALL
SELECT 2 as id , 20 As committed , '01-14-2022' as delivered UNION ALL
SELECT 3 as id , 12 As committed , '01-30-2022' as delivered UNION ALL
SELECT 5 as id , 2 As committed , '02-14-2022' as delivered
)) TMP
ON trim(id1) = trim(tmp.id)

regarding updating rows in table without using loop

I have table in which I have column called quantity also I have 10 rows which having same column value 200(it can be any value)
Requirement is: if a input value is x=500(or anynumber) then this value should be compared with quantity column value in a fasion below:
if 1 row's quantity is 200 then it should subtract it form 500 and x should be updated to 300 and quantity of that row should be made 0 then It should move to next row till x is 0
could you please help me write sql query for this...
it is ask that loops should not be used.
thanks,
What is the version of SQL Server? If it's 2012 or 2014, you can use the following:
DECLARE #x int = 500
;WITH cte_sum AS
(
SELECT quantity,
ISNULL(SUM(quantity) OVER (ORDER BY (SELECT NULL) ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING), 0) sum_running_before,
SUM(quantity) OVER (ORDER BY (SELECT NULL) ROWS UNBOUNDED PRECEDING) sum_running_total
FROM YourTable
)
UPDATE cte_sum
SET quantity = CASE
WHEN quantity >= #x - sum_running_before THEN
quantity - (#x - sum_running_before)
ELSE 0
END
WHERE (#x >= sum_running_total OR (#x < sum_running_total AND sum_running_before < #x))
It's a bit more tricky to get running totals in earlier versions but I think you got the main idea.
DECLARE #YourTable TABLE
(
CustId INT,
Quantity INT
)
INSERT INTO #YourTable
( CustId, Quantity )
VALUES
( 1, 10 ),
( 1, 10 ),
( 1, 10 ),
( 1, 10 ),
( 2, 20 ),
( 2, 20 ),
( 2, 20 ),
( 2, 20 );
;WITH cte_sum AS
(
SELECT
y.CustId,
y.Quantity,
ROW_NUMBER() OVER (PARTITION BY CustId ORDER BY Quantity) RN
FROM #YourTable y
)
SELECT s1.CustId, s1.Quantity, s2.Qty, s1.Quantity + ISNULL(s2.Qty, 0) RunningTotal, s1.RN
FROM cte_sum s1
OUTER APPLY
(
SELECT SUM(Quantity) Qty FROM cte_sum s2
WHERE s2.CustId = s1.CustId
AND s2.RN < s1.RN
) s2
ORDER BY s1.CustId, s1.RN
Here's an example of a running total that will work for Sql Server 2005+
This is the output:
CustId Quantity Qty RunningTotal RN
1 10 NULL 10 1
1 10 10 20 2
1 10 20 30 3
1 10 30 40 4
2 20 NULL 20 1
2 20 20 40 2
2 20 40 60 3
2 20 60 80 4

Split a string without delimiter in TSQl

I have a result set from SELECT Statement, how can i split one column without any delimiter
this is my result
Size TCount TDevice
2 5 E01
4.5 3 W02E01
I want to have this
Size TCount TDevice
2 5 E
2 5 0
2 5 1
4.5 3 W
4.5 6 0 (we have 2 times of 0)
4.5 3 2
4.5 3 1
thank you
You can join onto an auxiliary numbers table. I am using spt_values for demo purposes but you should create a permanent one.
WITH Nums
AS (SELECT number
FROM master..spt_values
WHERE type = 'P'
AND number BETWEEN 1 AND 1000),
Result(Size, TCount, TDevice)
AS (SELECT 2, 5,'E01'
UNION ALL
SELECT 4.5,3,'W02E01')
SELECT Size,
COUNT(*) * TCount AS TCount,
SUBSTRING(TDevice, number, 1) AS TDevice
FROM Result
JOIN Nums
ON Nums.number <= LEN(TDevice)
GROUP BY Size,
TCount,
SUBSTRING(TDevice, number, 1)
;with cte as
(
select Size,TCount,
substring(TDevice, 1, 1) as Chars,
stuff(TDevice, 1, 1, '') as TDevice
from t1
union all
select Size,TCount,
substring(TDevice, 1, 1) as Chars,
stuff(TDevice, 1, 1, '') as TDevice
from cte
where len(TDevice) > 0
)
select distinct Size,sum(TCount),Chars
from cte
group by Size,Chars
SQL Fiddle
Advantage: It doesn't require any User defined function (UDF) to be created.

Tsql group by clause with exceptions

I have a problem with a query.
This is the data (order by Timestamp):
Data
ID Value Timestamp
1 0 2001-1-1
2 0 2002-1-1
3 1 2003-1-1
4 1 2004-1-1
5 0 2005-1-1
6 2 2006-1-1
7 2 2007-1-1
8 2 2008-1-1
I need to extract distinct values and the first occurance of the date. The exception here is that I need to group them only if not interrupted with a new value in that timeframe.
So the data I need is:
ID Value Timestamp
1 0 2001-1-1
3 1 2003-1-1
5 0 2005-1-1
6 2 2006-1-1
I've made this work by a complicated query, but am sure there is an easier way to do it, just cant think of it. Could anyone help?
This is what I started with - probably could work with that. This is a query that should locate when a value is changed.
> SELECT * FROM Data d1 join Data d2 ON d1.Timestamp < d2.Timestamp and
> d1.Value <> d2.Value
It probably could be done with a good use of row_number clause but cant manage it.
Sample data:
declare #T table (ID int, Value int, Timestamp date)
insert into #T(ID, Value, Timestamp) values
(1, 0, '20010101'),
(2, 0, '20020101'),
(3, 1, '20030101'),
(4, 1, '20040101'),
(5, 0, '20050101'),
(6, 2, '20060101'),
(7, 2, '20070101'),
(8, 2, '20080101')
Query:
;With OrderedValues as (
select *,ROW_NUMBER() OVER (ORDER By TimeStamp) as rn --TODO - specific columns better than *
from #T
), Firsts as (
select
ov1.* --TODO - specific columns better than *
from
OrderedValues ov1
left join
OrderedValues ov2
on
ov1.Value = ov2.Value and
ov1.rn = ov2.rn + 1
where
ov2.ID is null
)
select * --TODO - specific columns better than *
from Firsts
I didn't rely on the ID values being sequential and without gaps. If that's the situation, you can omit OrderedValues (using the table and ID in place of OrderedValues and rn). The second query simply finds rows where there isn't an immediate preceding row with the same Value.
Result:
ID Value Timestamp rn
----------- ----------- ---------- --------------------
1 0 2001-01-01 1
3 1 2003-01-01 3
5 0 2005-01-01 5
6 2 2006-01-01 6
You can order by rn if you need the results in this specific order.

Resources