SQL Server: Lead/Lag analytic function across groups (and not within groups)

SQL Server: Lead/Lag analytic function across groups (and not within groups) - sql-server

Sorry for the long post, but I have provided copy & paste sample data and a possible solution approach below. The relevant part of the question is in the upper part of the post (above the horizontal rule).
I have the following table
Dt customer_id buy_time money_spent
-------------------------------------------------
2000-01-04 100 11:00:00.00 2
2000-01-05 100 16:00:00.00 1
2000-01-10 100 13:00:00.00 4
2000-01-10 100 14:00:00.00 3
2000-01-04 200 09:00:00.00 10
2000-01-06 200 10:00:00.00 11
2000-01-06 200 11:00:00.00 5
2000-01-10 200 08:00:00.00 20
and want a query to get this result set
Dt Dt_next customer_id buy_time money_spent
-------------------------------------------------------------
2000-01-04 2000-01-05 100 11:00:00.00 2
2000-01-05 2000-01-10 100 16:00:00.00 1
2000-01-10 NULL 100 13:00:00.00 4
2000-01-10 NULL 100 14:00:00.00 3
2000-01-04 2000-01-06 200 09:00:00.00 10
2000-01-06 2000-01-10 200 10:00:00.00 11
2000-01-06 2000-01-10 200 11:00:00.00 5
2000-01-10 NULL 200 08:00:00.00 20
That is: I want for each costumer (customer_id) and each day (Dt) the next day the same customer has visited (Dt_next).
I have already one query that gives the latter result set (data and query enclosed below the horizontal rule). However, it involves a left outer join and two dense_rank aggregate functions. This approach seems a bit clumsy to me and I think that there should be a better solution. Any pointers to alternative solutions highly appreciated! Thank you!
BTW: I am using SQL Server 11 and the table has >>1m entries.
My query:
select
customer_table.Dt
,customer_table_lead.Dt as Dt_next
,customer_table.customer_id
,customer_table.buy_time
,customer_table.money_spent
from
(
select
#customer_data.*
,dense_rank() over (partition by customer_id order by customer_id asc, Dt asc) as Dt_int
from #customer_data
) as customer_table
left outer join
(
select distinct
#customer_data.Dt
,#customer_data.customer_id
,dense_rank() over (partition by customer_id order by customer_id asc, Dt asc)-1 as Dt_int
from #customer_data
) as customer_table_lead
on
(
customer_table.Dt_int=customer_table_lead.Dt_int
and customer_table.customer_id=customer_table_lead.customer_id
)
Sample data:
create table #customer_data (
Dt date not null,
customer_id int not null,
buy_time time(2) not null,
money_spent float not null
);
insert into #customer_data values ('2000-01-04',100,'11:00:00',2);
insert into #customer_data values ('2000-01-05',100,'16:00:00',1);
insert into #customer_data values ('2000-01-10',100,'13:00:00',4);
insert into #customer_data values ('2000-01-10',100,'14:00:00',3);
insert into #customer_data values ('2000-01-04',200,'09:00:00',10);
insert into #customer_data values ('2000-01-06',200,'10:00:00',11);
insert into #customer_data values ('2000-01-06',200,'11:00:00',5);
insert into #customer_data values ('2000-01-10',200,'08:00:00',20);

Try this query:
select cd.Dt
, t.Dt_next
, cd.customer_id
, cd.buy_time
, cd.money_spent
from (
select Dt
, LEAD(Dt) OVER (PARTITION BY customer_id ORDER BY Dt) AS Dt_next
, customer_id
from (
select distinct Dt, customer_id
from #customer_data
) t
) t
inner join #customer_data cd on t.customer_id = cd.customer_id and t.Dt = cd.Dt
Why field money_spent has float type? You may have problems with calculations. Convert it to decimal type.

Related

How can I match a row in SQL Server only once?

I have the following problem where I am kindly asking for your help when joining two tables in SQL Server 2016 (v13).
I have 2 tables, Revenues and Cashins.
Revenues:
RevenueID
ProductID
InvoiceNo
Amount
123
456
987
1000
234
456
987
1000
Cashins:
CashinID
ProductID
InoviceNo
Amount
ABC
456
987
1000
CDE
456
987
1000
The goal is to match cashins automatically to revenues (but only once!).
Both tables have their unique-ids but the columns used to join these tables are
ProductID
InvoiceNo
Amount
For entries with only one row in each table with those criteria, everything works fine.
Sometimes though, there are several rows that have the same value within these columns (as above) but with a unique ID (this is no error, but the way it is supposed to be).
The problem with it is, that while joining it results in a cartesian product.
To recreate the tables, here the statements:
DROP TABLE IF EXISTS Revenues
GO
CREATE TABLE Revenues
(
RevenueID [nvarchar](10) NULL,
ProductID [nvarchar](10) NULL,
InvoiceNo [nvarchar](10) NULL,
Amount money NULL
)
GO
DROP TABLE IF EXISTS CashIns
GO
CREATE TABLE CashIns
(
CashinID [nvarchar](10) NULL,
ProductID [nvarchar](10) NULL,
InvoiceNo [nvarchar](10) NULL,
Amount money NULL
)
GO
INSERT INTO [Revenues] VALUES ('123', '456', '987', 1000)
INSERT INTO [Revenues] VALUES ('234', '456', '987', 1000)
INSERT INTO [CashIns] VALUES ('ABC', '456', '987', 1000)
INSERT INTO [CashIns] VALUES ('BCD', '456', '987', 1000)
Desired output:
RevenueID
ProductID
InvoiceNo
Amount
CashinID
123
456
987
1000
ABC
234
456
987
1000
CDE
SELECT
R.RevenueID,
R.ProductID,
R.InvoiceNo,
R.Amount,
C.CashinID,
FROM
[Revenues] R
LEFT JOIN
[CashIns] C ON R.ProductID = C.ProductID
AND R.InvoiceNo = C.InvoiceNo
AND R.Amount = C.Amount
Results:
RevenueID
ProductID
InvoiceNo
Amount
CashinID
123
456
987
1000
ABC
123
456
987
1000
CDE
234
456
987
1000
ABC
234
456
987
1000
CDE
Which in theory makes sense, but I just can't seem to find a solution where each row is just used once.
Two things I found and tried are windowing functions and the OUTER APPLY function with a TOP(1) selection. Both came to the same result:
SELECT
*
FROM
[Revenues] R
OUTER APPLY
(SELECT TOP(1) *
FROM [CashIns] C) C
Which returns the desired columns from the Revenues table, but only matched the first appearance from the Cashins table:
RevenueID
ProductID
InvoiceNo
Amount
CashinID
123
456
987
1000
ABC
234
456
987
1000
ABC
I also thought about something like updating the Revenues table, so that the matched CashinID is next to a line and then check every time that the CashinID is not yet used within that table, but I couldn't make it work...
Many thanks in advance for any help or hint in the right direction!

As I said in my comment, you have a fundamental problem with your data relationships. You need to reference the unique identifier of the other table in one of your tables. If you don't do that, then you can only order your transactions in both tables and join them by the row number. You're using a hope and prayer to join your data instead of unreputable identifier's.
--This example orders the transactions in each transaction table and uses
--the order number to join them.
WITH RevPrelim AS (
SELECT *
, ROW_NUMBER() OVER(PARTITION BY InvoiceNo, ProductID, Amount ORDER BY RevenueID) AS row_num
FROM [Revenues] R
), CashinsPrelim AS (
SELECT *
, ROW_NUMBER() OVER(PARTITION BY InvoiceNo, ProductID, Amount ORDER BY CashinID) AS row_num
FROM [CashIns] AS C
)
SELECT *
FROM RevPrlim AS r
LEFT OUTER JOIN CashinsPrelim AS c
ON c.ProductID = r.ProductID
AND c.InvoiceNo = r.InvoiceNo
AND c.Amount = r.Amount
AND c.row_num = r.row_num

Average using SQL Group by needs to omit duplicates and group by more than one column

I'm using SQL Server 2016 and I'm having an issue grouping by more than one col and finding an average while omitting duplicate rows. I have a transaction table defined as:
CREATE TABLE [dbo].[CUST_TRANSACTION](
[EXTRACT_DATE] [date] NULL,
[CUSTOMER_ID] [bigint] NULL,
[TRANS_NUMBER] [bigint] NULL,
[CATEGORY] [smallint] NULL,
[RANKING] [smallint] NULL )
Here is some data:
EXTRACT_DATE CUSTOMER_ID TRANS_NUMBER CATEGORY RANKING
10/31/2017 10001 1000101 4 100
10/31/2017 10001 1000102 4 100
10/31/2017 10002 1000201 4 200
10/31/2017 10001 1000103 5 100
10/31/2017 10003 1000301 5 300
10/31/2017 10003 1000302 5 300
10/31/2017 10004 1000401 7 500
10/31/2017 10001 1000104 8 100
The Customer_Id AND TRANS_NUMBER combo needs to be unique, but a customer_id can have 1 to Many Trans_Numbers and a Customer_Id can exist in 1 to many Categories. From the data I reviewed, the Ranking for a Customer_ID seems to be the same for a given EXTRACT_DATE. I found no NULLS in the Ranking, but I did find zeroes, so I need to exclude any zeroes from the Average.
The request is to generate a report broken down by each Category ( 1 - 15) and find the Average Ranking within that Category, but to only count a customer_id once and also find the Max Ranking with that Category. This is for a given EXTRACT_Date.
So I ran the following:
Select CATEGORY, MAX(RANKING) "Max Ranking", AVG(RANKING) "Average Ranking"
from CUST_TRANSACTION
where EXTRACT_DATE = Convert(datetime, '2017-10-31' )
and RANKING > 1
group by CATEGORY
order by CATEGORY
Generated the following output:
CATEGORY Max Ranking Average Ranking
4 200 133
5 300 233
7 500 500
8 100 100
But Category 4 should have an Average of 150 since customer_Id = 10001 has two entries and Category 5 should be = 200 since Customer_id 10003 has two entries.
When I tried to Group by both Category, Customer_Id, the output includes each combination of Category and Customer_Id, which is what Group by does. So I'm not sure if I need a sub-select or any other ideas?
Thanks

it looks like you don't care about the trans_number mappings, so you could remove it and choose distinct remaining values in a derived table:
Select CATEGORY, MAX(RANKING) "Max Ranking", AVG(RANKING) "Average Ranking"
from ( select distinct [EXTRACT_DATE] ,
[CUSTOMER_ID] ,
[CATEGORY] ,
[RANKING] from CUST_TRANSACTION )CUST_TRANSACTION
where EXTRACT_DATE = Convert(datetime, '2017-10-31' )
and RANKING > 1
group by CATEGORY
order by CATEGORY

You can use Common Table Expression (CTE) to filter out duplicate customerID in a category. Something like this.
;with cte as (
select CATEGORY, RANKING, EXTRACT_DATE
ROW_NUMBER() over(partition by category, customer_id order by customer_id) rn
from CUST_TRANSACTION
)
Select CATEGORY, MAX(RANKING) "Max Ranking", AVG(RANKING) "Average Ranking"
from cte --CUST_TRANSACTION
where EXTRACT_DATE = Convert(datetime, '2017-10-31' )
and RANKING > 1
and rn = 1
group by CATEGORY
order by CATEGORY

Due to different requirements of overall average and maximum you can't use a single column to get both. A sub-select will deliver one column for averaging and another for maximum'ing.
DECLARE #QUERY_DATE DATE = '2017-10-31';
Select
CATEGORY
, MAX(RANKING_detail_max) "Max Ranking"
, AVG(RANKING_detail_sum) "Average Ranking"
from (
select CATEGORY
, CUSTOMER_ID
, SUM(RANKING) RANKING_detail_sum
, MAX(RANKING) RANKING_detail_max
from CUST_TRANSACTION
where EXTRACT_DATE = #QUERY_DATE
and RANKING > 0
group by CATEGORY, CUSTOMER_ID
) rollup
group by CATEGORY
order by CATEGORY

parent id hierarchy identification MS SqlServer2012

I have this code
create table #temp
(
order_id int not null identity(1,1) primary key
,sid int
,created_date date
,parent_order_id int
)
insert into #temp
(
sid
,created_date
)values(1,'2017-01-01')
insert into #temp
(
sid
,created_date
,parent_order_id
)values(1,'2017-02-01',1),(1,'2017-03-01',2),(1,'2017-04-01',3)
insert into #temp
(
sid
,created_date
)values(1,'2017-06-01')
insert into #temp
(
sid
,created_date
,parent_order_id
)values(1,'2017-07-01',5),(1,'2017-08-01',6)
select * from #temp
Whenever parent_order_id is null which indicates it is a new order. After that customer can add items associated to that order. so we have parent_order_id filled for these associations. But I want to know what is the first order_id for each association child order.I am looking for an output like below.
`order_id sid created_date parent_order_id original_order_id
1 1 2017-01-01 NULL 1
2 1 2017-02-01 1 1
3 1 2017-03-01 2 1
4 1 2017-04-01 3 1
5 1 2017-06-01 NULL 4
6 1 2017-07-01 5 4
7 1 2017-08-01 6 4
`
any help is appreciated. Thanks in advance.

With the following piece of code you can get results you are expecting.
;WITH cte (order_id, original_order_id)
AS
(
SELECT order_id, order_id AS original_order_id
FROM #temp WHERE parent_order_id IS NULL
UNION ALL
SELECT o.order_id AS order_id, cte.original_order_id AS original_order_id
FROM #temp AS o
JOIN cte
ON o.parent_order_id = cte.order_id
)
SELECT #temp.order_id, #temp.sid, #temp.created_date, #temp.parent_order_id, cte.original_order_id
FROM #temp
JOIN cte ON cte.order_id=#temp.order_id
ORDER BY cte.order_id
Please be aware, that there are certain limits on recursion as this for CTE. Currently it is 100 which can be pushed up to 32767.

Modifying the current row based on the previous row in sql server

I have a result set like this:
YearMonth Sales
201411 100
201412 100
201501 100
201502 100
201503 100
201504 100
201505 100
201506 100
201507 100
201508 100
Need to add another row with 4% more sales than the previous month. For example my Result should be
YearMonth Sales New Sales
201411 100 100.00
201412 100 104.00
201501 100 108.16
201502 100 112.49
201503 100 116.99
201504 100 121.67
201505 100 126.53
201506 100 131.59
201507 100 136.86
201508 100 142.33
Please help me to get the best way for it.

Got perfect answer for your requirement. It took long time to figure out. Just change the #Temp table name with your table name and verify the column names also.
DECLARE #nCurrentSale FLOAT
DECLARE #nYeatDate INT
DECLARE #nSale FLOAT
CREATE TABLE #TempNEW(YearMonth VARCHAR(10), Sales FLOAT, NewSale FLOAT)
SELECT TOP 1 #nCurrentSale = Sales FROM #Temp
ORDER BY (CAST('01/' + SUBSTRING (CAST(YearMonth AS VARCHAR), 5 , 2) + '/' + SUBSTRING (CAST(YearMonth AS
VARCHAR), 0 , 5) AS DATETIME)) ASC
DECLARE Cursor1 CURSOR FOR
SELECT YearMonth, Sales FROM #Temp
ORDER BY (CAST('01/' + SUBSTRING (CAST(YearMonth AS VARCHAR), 5 , 2) + '/' + SUBSTRING (CAST(YearMonth AS
VARCHAR), 0 , 5) AS DATETIME)) ASC
OPEN Cursor1
FETCH NEXT FROM Cursor1 INTO #nYeatDate, #nSale
WHILE ##FETCH_STATUS = 0
BEGIN
INSERT INTO #TempNEW(YearMonth, Sales, NewSale) VALUES(#nYeatDate, #nSale, CAST(#nCurrentSale AS DECIMAL(12,2)))
SET #nCurrentSale = #nCurrentSale + ((#nCurrentSale/100) * 4)
FETCH NEXT FROM Cursor1 INTO #nYeatDate, #nSale
END
CLOSE Cursor1
DEALLOCATE Cursor1
SELECT * FROM #TempNEW
Notify me with your status.

Yes it possible. But first you have to alter the table and add the extra column NewSales then try with this link
https://dba.stackexchange.com/questions/34243/update-row-based-on-match-to-previous-row
i think you can done it through this link
Also sql server support some "Computed Columns in SQL Server with Persisted Values"
using that you can specify the formula what you want, then the new column value will automatically created according to your formula

Here's two thoughts... Not super clear if I understood the use case... Also this solution will only work for SQL 2012 and up
So given the table
CREATE TABLE [dbo].[LagExample](
[YearMonth] [nvarchar](100) NOT NULL,
[Sales] [money] NOT NULL
)
First one is fairly simple and just assumes you are wanting to base the magnitude of your percentage increase on how many days came before it...
;WITH cte
as
(
SELECT YearMonth,
ROW_NUMBER() OVER (ORDER BY YearMonth) - 1 AS SalesEntry,
cast(LAG(Sales, 1,Sales) OVER (ORDER BY YearMonth) as float) as Sales
FROM LagExample
)
SELECT YearMonth,
Sales,
cast(Sales * POWER(cast(1.04 as float), SalesEntry) AS decimal(10,2)) as NewSales
FROM cte
The Second one uses a recursive CTE to calculate the value as you move along the months..
Here's a good link about recursive CTEs
http://www.codeproject.com/Articles/683011/How-to-use-recursive-CTE-calls-in-T-SQL
;with data
as
(
SELECT Lead(le.YearMonth, 1, null) OVER (ORDER BY le.YearMonth) as NextYearMonth,
cast(le.Sales as Decimal(10,4)) as Sales,
le.YearMonth
FROM LagExample le
)
,cte
as
(
SELECT *
FROM data
Where YearMonth = '201411'
UNION ALL
SELECT
data.NextYearMonth,
cast(cte.Sales * 1.04 as Decimal(10,4)) as Sales,
data.YearMonth
From cte join
data on data.YearMonth = cte.NextYearMonth
)
SELECT YearMonth, cast(Sales as Decimal(10,2))
FROM cte
order by YearMonth

SQL - Query log from Users table

Based on the following example : (it is a "QueryLog" table, this table store interactions between a user and two different products N and R):
Id Date UserID Product
--------------------------------------------------
0 2013-06-09 14:50:24.000 100 N
1 2013-06-09 15:27:23.000 100 N
2 2013-06-09 15:29:23.000 100 N
3 2013-06-17 15:31:23.000 100 N
4 2013-06-17 15:32:23.000 100 N
5 2014-05-19 15:30:23.000 250 N
6 2014-07-19 15:27:23.000 250 N
7 2014-07-19 15:27:23.000 333 R
8 2014-08-19 15:27:23.000 333 R
Expected results :
Count
-----
1
(Only UserID 250 is inside my criteria)
If one user interacts 10 times with the product in only one month, he's not in my criteria.
To resume, I am looking for :
The Number of distinct users that had interactions with product N on at least more than one month (what ever the number of interactions this user may have had during a single month)
This is the code I've tried:
select distinct v.UserID, v.mois , v.annee
from
(select c.UserID , c. mois, c.annee, COUNT(c.UserID) as frequence
from
(
SELECT
datepart(month,[DATE]) as mois,
datepart(YEAR,[DATE]) as annee ,
Username,
UserID,
Product
FROM QueryLog
where Product = 'N'
) c
group by c.UserID, c.annee, c.mois
) v
group by v.UserID, v.mois, v.annee

try this:
DECLARE #YourTable table (Id int, [Date] datetime, UserID int, Product char(1))
INSERT INTO #YourTable VALUES (0,'2013-06-09 14:50:24',100 ,'N')
,(1,'2013-06-09 15:27:23',100 ,'N')
,(2,'2013-06-09 15:29:23',100 ,'N')
,(3,'2013-06-17 15:31:23',100 ,'N')
,(4,'2013-06-17 15:32:23',100 ,'N')
,(5,'2014-05-19 15:30:23',250 ,'N')
,(6,'2014-07-19 15:27:23',250 ,'N')
,(7,'2014-07-19 15:27:23',333 ,'R')
,(8,'2014-08-19 15:27:23',333 ,'R')
;WITH MultiMonthUsers AS
(
select
UserID
FROM (select
UserID
FROM #YourTable
WHERE product='N'
GROUP BY UserID, YEAR([Date]),MONTH([Date])
)dt2
GROUP BY UserID
HAVING COUNT(*)>1
)
SELECT COUNT(*) FROM MultiMonthUsers
Depending on number of rows and indexes, this will run slow. Using YEAR([Date]),MONTH([Date]) will prevent any index usage.

I think this will do it, but I need a better dataset to test with:
SELECT COUNT(*)
FROM (
--roll all month/user records into single row
SELECT UserID, datediff(month 0, [date]) As MonthGroup
FROM QueryLog
WHERE Product='N'
GROUP BY datediff(month 0, [date]), UserId
) t
-- look for users with multiple rows
GROUP BY UserID
HAVING COUNT(UserID) > 1
Seems like there should be a way to roll this up further, to avoid the need for the nested select.