Nested window function not working in snowflake - snowflake-cloud-data-platform

I am working on migration of spark sql to snowsql.
At one point i got a scenario where i have used nested window functions in spark sql. And i want to migrate that sql query into snowflake. But snowflake doesn't support nested window functions.
Spark sql query -
SELECT
*,
(case when (
(
lead(timestamp -lag(timestamp)
over (partition by session_id order by timestamp))
over (partition by session_id order by timestamp)
) is not null)
then
(
lead(timestamp -lag(timestamp)
over (partition by session_id order by timestamp))
over (partition by session_id order by timestamp)
)
else 0 end)/1000 as pg_to_pg
FROM dwell_time_step2
Output -
I have tried to convert above query into snowflake as below.
Converted Snowsql -
with lagsession as (
SELECT
a.*,
lag(timestamp) over (partition BY session_id order by timestamp asc) lagsession
FROM mktg_web_wi.dwell_time_step2 a
)
select
a.,
nvl(lead(a.timestamp - b.lagsession) over (partition BY a.session_id order by a.timestamp),0)/1000 pg_to_pg
FROM mktg_web_wi.dwell_time_step2 a,
lagsession b
WHERE a.key=b.key
order by timestamp;
Output -
Here, problem is in Snow-sql output. Dwelltime value is getting assigned to different urls.
Expectation is make spark-sql query work on snowsql and output should be same in both cases.
Please let me know if anybody know how this problem can be solved.
Thanks !!

I think that changing this from a nested window function to a cte has changed what records the lag and lead are referring to, but this is tricky to get my head around.
Regardless, if I'm understanding your code here, I think there's a much simpler approach with only one windows function.
select
a.*,
(nvl(lead(a.timestamp) over (partition BY a.session_id order by a.timestamp) - a.timestamp)/1000,0) pg_to_pg
FROM mktg_web_wi.dwell_time_step2 a
order by timestamp;

Related

How to calculate the days between 2 dates from 2 successive IDs

I need some help to create a new column in a database in SQL Server 2008.
I have the following data table
Please have a look at a snapshot of my table
Table
In the blank column I would like to put the difference between the current status date and the next status' date. And for the last ID_Status for each ID_Ticket I would like to have the difference between now date and it's date !
I hope that you got an idea about my problem.
Please share if you have any ideas about how to do .
Many thanks
kind regards
You didn't specify your RDBMS, so I'll post an answer for both since they are almost identical :
SQL-Server :
SELECT ss.id_ticket,ss.id_status,ss.date_status,
DATEDIFF(day,ss.date_status,ss.coalesce(ss.next_date,GETDATE())) as diffStatus
FROM (
SELECT t.*,
(SELECT TOP 1 s.date_status FROM YourTable s
WHERE t.id_ticket = s.id_ticket and s.date_status > t.date_status
ORDER BY s.date_status ASC) as next_date)
FROM YourTable t) ss
MySQL :
SELECT ss.id_ticket,ss.id_status,ss.date_status,
DATEDIFF(ss.date_status,ss.coalesce(ss.next_date,now())) as diffStatus
FROM (
SELECT t.*,
(SELECT s.date_status FROM YourTable s
WHERE t.id_ticket = s.id_ticket and s.date_status > t.date_status
ORDER BY s.date_status ASC limit 1) as next_date)
FROM YourTable t) ss
This basically first use a correlated sub query to bring the next date using limit/top , and then wrap it with another select to calculate the difference between them using DATEDIFF().
Basically it can be done without the wrapping query, but it won't be readable since the correlated query will be inside the DATEDIFF() function, so I prefer this way.

Replace Group By clause with any other clause

In below query, I am using GROUP BY clause to get list of recently updated records depends on updated date. But I would like to have the query without a GROUP BY clause because of some internal reasons. Can please any one help me to solve this.
SELECT Proj_UpdatedDate,
Proj_UpdatedBy
FROM ProjectProgress PP
WHERE Proj_UpdatedDate IN (SELECT MAX(Proj_UpdatedDate)
FROM ProjectProgress
GROUP BY
Proj_ProjectID)
ORDER BY
Proj_ProjectID
Using TOP 1 should give you the same result assuming you meant the MAX(Proj_UpdatedDate):
SELECT Proj_UpdatedDate,
Proj_UpdatedBy
FROM ProjectProgress PP
WHERE Proj_UpdatedDate IN (SELECT TOP 1 Proj_UpdatedDate
FROM ProjectProgress
ORDER BY Proj_UpdatedDate DESC)
ORDER BY
Proj_ProjectID
However your query actually returns multiple dates since it's GROUPED BY Proj_ProjectId (the max date for each project). Is that your desired outcome - to show a list of dates that the projects were updated and by whom?
If so, try using ROW_NUMBER():
SELECT Proj_UpdatedDate, Proj_UpdatedBy
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY Proj_ProjectID ORDER BY Proj_UpdatedBy DESC) rn,
Proj_UpdatedDate,
Proj_UpdatedBy
FROM ProjectProgress
) t
WHERE rn = 1
And here is the SQL Fiddle. This assumes you are running SQL Server 2005 or greater.
Good luck.

rank() over in hive

I'm converting a SQL Server stored procedure to HiveQL.
How can I convert something like:
SELECT
p.FirstName, p.LastName,
RANK() OVER (ORDER BY a.PostalCode) AS Rank
I Have seen this use case a few times, there is a way to do something similar to RANK() in Hive using a UDF.
There are basically a few steps:
Partition the data into groups with DISTRIBUTE BY
Order the data in each group with SORT BY
There is actually a nice article on the topic, and you can also find some code from Edward Capriolo here.
Here is an example query doing a rank in Hive:
ADD JAR p-rank-demo.jar;
CREATE TEMPORARY FUNCTION p_rank AS 'demo.PsuedoRank';
SELECT
category,country,product,sales,rank
FROM (
SELECT
category,country,product,sales,
p_rank(category, country) rank
FROM (
SELECT
category,country,product,
sales
FROM p_rank_demo
DISTRIBUTE BY
category,country
SORT BY
category,country,sales desc) t1) t2
WHERE rank <= 3
Which does the equivalent of the following query in MySQL:
SELECT
category,country,product,sales,rank
FROM (
SELECT
category,country,product, sales,
rank() over (PARTITION BY category, country ORDER BY sales DESC) rank
FROM p_rank_demo) t
WHERE rank <= 3
For anyone who comes across this question, Hive now supports rank() and other analytics functions.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
No big deal
Somehow in Hive 0.13.1 that comma does not work :-(
so I got it working without the comma
(PARTITION BY category country ORDER BY sales DESC)

How to replicate mysql range and limit in SQL Server using the top clause

I've been trying to replicate the limit and range feature provided in MySql in SQL Server with no luck as of yet. I have found several guides and now think my sql code is nearly correct but I'm still getting an error posted below.
System.Data.SqlClient.SqlException:
Only one expression can be specified
in the select list when the subquery
is not introduced with EXISTS
The error code says to use EXISTS but i have tried that instead of NOT IN and i still get an error.
My sql is posted below
SELECT TOP (#range) *
FROM client
WHERE clientId NOT IN
(SELECT TOP (#limit) *
FROM client
ORDER BY clientId)
ORDER BY clientId
The change you need to make to your code is
SELECT TOP (#range) *
FROM client
WHERE clientId NOT IN (SELECT TOP (#limit) clientId /*<-- NOT "*" here */
FROM client
ORDER BY clientId)
ORDER BY clientId
This can also be done by using row_number as below (which performs better depends on the different indexes available and how wide a covering index on the whole query is compared to a narrow one on just clientId.)
DECLARE #lowerlimit int
SET #lowerlimit = #range +#limit;
WITH cte As
(
SELECT TOP (#lowerlimit) * , ROW_NUMBER() OVER (ORDER BY clientId) AS RN
FROM client
ORDER BY clientId
)
SELECT * /*TODO: Your Actual column list*/
FROM cte
WHERE RN >= #limit
Another (similar, slower :) ) way
SELECT * FROM (
select rank() over (ORDER BY yourorder) as rank, *.X
from TableX X
) x2 WHERE rank between 5 and 10

SQL Server count(*) issue

I am using SQL Server 2008 Enterprise. I am using the following statement in SQL Server Management Studio as a part of a store procedure, and there is following error (compile error when I press F5 to run the store procedure). But when I removed count(), all error disappears. Any ideas what is wrong with count()? Another question is, my purpose is to return the total number of matched result and only return a part of result to implement paging (using tt.rowNum between #startPos and #requireCount + #startPos-1), any ideas how to implement that?
SELECT *
FROM (SELECT count(*), t.id,t.AdditionalInfo, ROW_NUMBER()
OVER (order by t.id) AS rowNum
FROM dbo.foo t
CROSS APPLY t.AdditionalInfo.nodes('/AdditionalInfo')
AS MyTestXMLQuery(AdditionalInfo)
WHERE
(Tag4=''+#InputTag4+'' OR Tag5=''+#InputTag5+'')
and (MyTestXMLQuery.AdditionalInfo.value
('(Item[#Name="Tag1"]/#Value)[1]', 'varchar(50)')
LIKE '%'+#Query+'%'
or MyTestXMLQuery.AdditionalInfo.value
('(Item[#Name="Tag2"]/#Value)[1]', 'varchar(50)')
LIKE '%'+#Query+'%'
or MyTestXMLQuery.AdditionalInfo.value
('(Item[#Name="Tag3"]/#Value)[1]', 'varchar(50)')
LIKE '%'+#Query+'%') ) tt
WHERE tt.rowNum between #startPos and #requireCount + #startPos-1
Error message,
Column 'dbo.foo.ID' is invalid in the select list
because it is not contained in either an aggregate function
or the GROUP BY clause.
No column name was specified for column 1 of 'tt'.
thanks in advance,
George
Replace it with
SELECT count(*) over() AS [Count]
It needs an alias as it is a column in a derived table.
The empty over() clause will return the count in the whole derived table. Is that what you need?
You generally can't mix aggregate functions and normal field selections without a GROUP BY clause.
In queries where you are only selecting a COUNT(*) it assumes you mean to lump everything together in one group. Once you select another field (without a corresponding GROUP BY), you introduce a contradiction to that assumption and it will not execute.
You need to have a GROUP BY clause. Try this:
SELECT *
FROM (SELECT
count(*) AS c, t.id,t.AdditionalInfo
FROM
dbo.foo t
CROSS APPLY
t.AdditionalInfo.nodes('/AdditionalInfo') AS MyTestXMLQuery(AdditionalInfo)
WHERE
(Tag4=''+#InputTag4+'' OR Tag5=''+#InputTag5+'')
and (MyTestXMLQuery.AdditionalInfo.value('(Item[#Name="Tag1"]/#Value)[1]', 'varchar(50)') LIKE '%'+#Query+'%'
or MyTestXMLQuery.AdditionalInfo.value('(Item[#Name="Tag2"]/#Value)[1]', 'varchar(50)') LIKE '%'+#Query+'%'
or MyTestXMLQuery.AdditionalInfo.value('(Item[#Name="Tag3"]/#Value)[1]', 'varchar(50)') LIKE '%'+#Query+'%')
GROUP BY t.id,t.AdditionalInfo
) tt
WHERE tt.rowNum between #startPos and #requireCount + #startPos-1
There might be more. Not sure.
Either way, it would do you a lot of good to learn about the theory behind the relational database model. This query needs a lot more help than what I just added. I mean it needs A LOT more help.
Edit: You also can't have a ROW_NUMBER() in a query that selects COUNT(*). What would you be trying to number? The number of Counts?
A guess, cause I can't run it, but try Changing it to:
Select * From
(Select count(*), t.id, t.AdditionalInfo, ROW_NUMBER()
OVER (order by t.id) AS rowNum
From dbo.foo t
CROSS APPLY t.AdditionalInfo.nodes('/AdditionalInfo')
AS MyTestXMLQuery(AdditionalInfo)
Where
(Tag4=''+#InputTag4+'' OR Tag5=''+#InputTag5+'')
and (MyTestXMLQuery.AdditionalInfo.value
('(Item[#Name="Tag1"]/#Value)[1]', 'varchar(50)')
LIKE '%'+#Query+'%'
or MyTestXMLQuery.AdditionalInfo.value
('(Item[#Name="Tag2"]/#Value)[1]', 'varchar(50)')
LIKE '%'+#Query+'%'
or MyTestXMLQuery.AdditionalInfo.value
('(Item[#Name="Tag3"]/#Value)[1]', 'varchar(50)')
LIKE '%'+#Query+'%')
Group By t.id, t.AdditionalInfo ) tt
Where tt.rowNum between #startPos and #requireCount + #startPos-1

Resources