Weekly Streaks - Snowflake - snowflake-cloud-data-platform

I have various questions about consecutive customer visits, it spans from a 90 days, to a monthly, quarterly, and currently the question is weekly. So I really am looking for easy to duplicate and edit code that can cover my bases.
I found some helpful articles to get me started, but am having some trouble deciphering the output, and is causing me some trouble with validating.
the code used is:
with output as
(select distinct
unqclient
,dategroup
,datediff('week',min(weekofclass),max(weekofclass)) as streak
,min(weekofclass) as startdate
,max(weekofclass) as enddate
from
(select
*
,dateadd('week',-rank,weekofclass) as dategroup
from
(select distinct
concat(v.location, v.clientid) as unqclient
,cast(v.classdate as date) as classdate2
,date_trunc('week',classdate2) as weekofclass
,dense_rank() over (partition by unqclient order by weekofclass) as rank
from visit_data as v
where (missed=false and cancelled=false)
)
)
group by 1,2)
select
output.*
,c.location
,c.emailname
,s.studioname
from output
left join clients as c
on concat(c.location,c.clientid)=output.unqclient
left join studios as s
on c.location=s.location
where streak>1
and date_trunc('week',enddate)>=date_trunc('week',current_date()-7)
order by streak desc
The days for the start and end dates are right (showing the monday of the week their first and last visit in the streak occurred). But I don't understand what 'dategroup' is outputting, they do not correspond to first visits of that client or take place during the streak.
Example of Output from code above:
Client
Date Group
Streak
Start Date
End Date
A
2020-03-02
116
2020-06-15
2022-09-05
B
2017-04-24
122
2020-05-04
2022-09-05

So if we start with some fake data with weekly streaks:
with visit_data(location, clientid, classdate, missed, cancelled ) as (
select * from values
-- (1, 10, '2022-06-15 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-06-22 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-06-29 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-07-06 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-07-13 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-07-13 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-07-20 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-07-27 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-08-03 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-08-10 10:12:13'::timestamp, false, false),
(1, 10, '2022-08-17 10:12:13'::timestamp, false, false),
(1, 10, '2022-08-24 10:12:13'::timestamp, false, false),
(1, 10, '2022-08-31 10:12:13'::timestamp, false, false),
(1, 10, '2022-09-7 10:12:13'::timestamp, false, false),
(1, 10, '2022-09-14 10:12:13'::timestamp, false, false)
), clients(location, clientid, emailname) as (
select * from values
(1, 10, 'email_10#example.com')
), studios(location, studioname) as (
select * from values
(1, 'location_1')
)
we can move the two inner sub-select into one block and inspect it:
select distinct
v.location
,v.clientid
,date_trunc('week', v.classdate) as weekofclass
,dense_rank() over (partition by v.location, v.clientid order by weekofclass) as rank
,dateadd('week', -rank, weekofclass) as dategroup
from visit_data as v
where missed = false and cancelled = false
gives:
LOCATION
CLIENTID
WEEKOFCLASS
RANK
DATEGROUP
1
10
2022-08-15 00:00:00.000
1
2022-08-08 00:00:00.000
1
10
2022-08-22 00:00:00.000
2
2022-08-08 00:00:00.000
1
10
2022-08-29 00:00:00.000
3
2022-08-08 00:00:00.000
1
10
2022-09-05 00:00:00.000
4
2022-08-08 00:00:00.000
1
10
2022-09-12 00:00:00.000
5
2022-08-08 00:00:00.000
we can drop the timestamp -> date cast (if that is what is happening) as date trunc will reduce it down more, if it was a string inside visit_data it shouldn't be, can be inline case with:
,date_trunc('week', v.classdate::date) as weekofclass
next the date group can be calculated at the same time.
Datagroup is just a clustering group, it meaning is not really important, how it works is the dense_rank will give increasing numbers over the distinct truncated weeks, and minusing that rank from the week, if they are part of the same batch will end up with the same result
If we put some "gaps in the data"
with visit_data(location, clientid, classdate, missed, cancelled ) as (
select * from values
(1, 10, '2022-06-15 10:12:13'::timestamp, false, false),
(1, 10, '2022-06-22 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-06-29 10:12:13'::timestamp, false, false),
(1, 10, '2022-07-06 10:12:13'::timestamp, false, false),
(1, 10, '2022-07-13 10:12:13'::timestamp, false, false),
(1, 10, '2022-07-13 10:12:13'::timestamp, false, false),
(1, 10, '2022-07-20 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-07-27 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-08-03 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-08-10 10:12:13'::timestamp, false, false),
(1, 10, '2022-08-17 10:12:13'::timestamp, false, false),
(1, 10, '2022-08-24 10:12:13'::timestamp, false, false),
(1, 10, '2022-08-31 10:12:13'::timestamp, false, false),
(1, 10, '2022-09-7 10:12:13'::timestamp, false, false),
(1, 10, '2022-09-14 10:12:13'::timestamp, false, false)
)
we get:
LOCATION
CLIENTID
WEEKOFCLASS
RANK
DATEGROUP
1
10
2022-06-13 00:00:00.000
1
2022-06-06 00:00:00.000
1
10
2022-06-20 00:00:00.000
2
2022-06-06 00:00:00.000
1
10
2022-07-04 00:00:00.000
3
2022-06-13 00:00:00.000
1
10
2022-07-11 00:00:00.000
4
2022-06-13 00:00:00.000
1
10
2022-07-18 00:00:00.000
5
2022-06-13 00:00:00.000
1
10
2022-08-15 00:00:00.000
6
2022-07-04 00:00:00.000
1
10
2022-08-22 00:00:00.000
7
2022-07-04 00:00:00.000
1
10
2022-08-29 00:00:00.000
8
2022-07-04 00:00:00.000
1
10
2022-09-05 00:00:00.000
9
2022-07-04 00:00:00.000
1
10
2022-09-12 00:00:00.000
10
2022-07-04 00:00:00.000
we see the rank steps in ones, but some week have a gap, and thus the minus gives a different number. This is called Gaps And Islands
After that, the GROUP BY and DISTINCT in the output is redunant, you should only need one or the other, not both
in the outer select, the date_trunc('week',enddate) is not needed as the date is already truncated to week, and functions should be avoid on WHERE clauses, and it can get pushed as a HAVING into the output CTE.
thus with the data CTE's included:
with visit_data(location, clientid, classdate, missed, cancelled ) as (
select * from values
(1, 10, '2022-06-15 10:12:13'::timestamp, false, false),
(1, 10, '2022-06-22 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-06-29 10:12:13'::timestamp, false, false),
(1, 10, '2022-07-06 10:12:13'::timestamp, false, false),
(1, 10, '2022-07-13 10:12:13'::timestamp, false, false),
(1, 10, '2022-07-13 10:12:13'::timestamp, false, false),
(1, 10, '2022-07-20 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-07-27 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-08-03 10:12:13'::timestamp, false, false),
-- (1, 10, '2022-08-10 10:12:13'::timestamp, false, false),
(1, 10, '2022-08-17 10:12:13'::timestamp, false, false),
(1, 10, '2022-08-24 10:12:13'::timestamp, false, false),
(1, 10, '2022-08-31 10:12:13'::timestamp, false, false),
(1, 10, '2022-09-7 10:12:13'::timestamp, false, false),
(1, 10, '2022-09-14 10:12:13'::timestamp, false, false)
), clients(location, clientid, emailname) as (
select * from values
(1, 10, 'email_10#example.com')
), studios(location, studioname) as (
select * from values
(1, 'location_1')
), inner_a as(
select distinct
v.location
,v.clientid
,date_trunc('week', v.classdate) as weekofclass
,dense_rank() over (partition by v.location, v.clientid order by weekofclass) as rank
,dateadd('week', -rank, weekofclass) as dategroup
from visit_data as v
where missed = false and cancelled = false
order by rank
)
--with
,output as (
select --distinct
location
,clientid
,dategroup
,min(weekofclass) as startdate
,max(weekofclass) as enddate
,datediff('week', startdate, enddate) as streak
from inner_a
group by 1,2,3
having enddate >= date_trunc('week',current_date()-7)
)
select
o.*
,c.location
,c.emailname
,s.studioname
from output as o
left join clients as c
on c.location = o.location
and c.clientid = o.clientid
left join studios as s
on c.location = s.location
where streak>1
order by streak desc
gives:
LOCATION
CLIENTID
DATEGROUP
STARTDATE
ENDDATE
STREAK
LOCATION
EMAILNAME
STUDIONAME
1
10
2022-07-04 00:00:00.000
2022-08-15 00:00:00.000
2022-09-12 00:00:00.000
4
1
email_10#example.com
location_1

Related

for sql server, I want to select all of the records that has changed

for example, my table has a record for each date, and each date's record could be same as the previous date record, could be different. my case is from date 1 to date 3, all of the record are same, and then date 4, the record is changed, date 5 the record is changed too, but it changed back to same as date 3. Now I want to a way to query the table and get the records of date 1, date 4 and date 5. Any idea, how to do it? Thanks
I read the issue above, is that a) you take daily logs of all rows, and b) you want to report on any row that is different from the previous day's.
SQL Server has a great function for dealing with differences across a large number of columns - EXCEPT. It also has the advantage of treating NULLs as distinct values - so a change from something to NULL, or vice versa, counts as a change. This is not true for most equality/inequality checks.
Here is a version where I create a daily snapshot of some fields from a 'users' table.
The SELECT query finds all rows from the log, except where the previous entry in the log is the same.
CREATE TABLE #UserLog (LogDate date, UserID int, UserName nvarchar(100), UserEmail nvarchar(100), LastLogonDate datetime, PRIMARY KEY (LogDate, UserID));
INSERT INTO #UserLog (LogDate, UserID, UserName, UserEmail, LastLogonDate) VALUES
('20201011', 1, 'Bob', NULL, '20201009 15:38'),
('20201012', 1, 'Bob', NULL, '20201009 15:38'),
('20201013', 1, 'Bob', 'Bob#gm.com', '20201012 09:15'),
('20201014', 1, 'Bob', 'Bob#gm.com', '20201013 19:02'),
('20201015', 1, 'Bob', 'Bob#gm.com', '20201013 19:02'),
('20201017', 1, 'Bob', 'Bob#gm.com', '20201013 19:02'),
('20201013', 2, 'Pat', 'Pat#hm.com', NULL),
('20201014', 2, 'Pat', 'Pat#hm.com', NULL),
('20201015', 2, 'Pat', 'Pat#hm.com', '20201014 20:55'),
('20201017', 2, 'Pat', 'Pat#hm.com', '20201016 13:22');
SELECT LogDate, UserID, UserName, UserEmail, LastLogonDate
FROM #UserLog
EXCEPT
SELECT LEAD(LogDate) OVER (PARTITION BY UserID ORDER BY LogDate), UserID, UserName, UserEmail, LastLogonDate
FROM #UserLog
ORDER BY UserID, LogDate;
In the 'EXCEPT' segment, it basically gets the data for each given row, then changes the date to the next date in sequence for that user e.g., it turns
('20201012', 1, 'Bob', NULL, '20201009 15:38'),
into
('20201013', 1, 'Bob', NULL, '20201009 15:38'),
As this is not the same as the actual row for Bob on the 13th, the row in the top part of the statement shows.
My initial test run of this simply had a DATEADD(day, 1, Logdate) in the EXCEPT portion, and that would show all rows that were different from yesterday's. However, the updated version above allows for breaks in the sequence (e.g., in the above, the logging failed on the 16th).
Here's a DB<>fiddle with the code above.
UPDATE - data posted in comment in another answer.
Here's a version with that data.
CREATE TABLE #tLog (LogDate date, v_1 int, v_2 varchar(100), v_3 int, v_4 varchar(10), v_5 int, v_6 varchar(10));
INSERT INTO #tLog (Logdate, v_1, v_2, v_3, v_4, v_5, v_6) VALUES
('20200101', 100, 'test_1', 0, '123', 120, 'JJ'),
('20200102', 100, 'test_1', 0, '123', 120, 'JJ'),
('20200103', 100, 'test_1', 0, '123', 120, 'JJ'),
('20200104', 101, 'test_1', 1, '123', 120, 'JJ'),
('20200105', 100, 'test_1', 0, '123', 120, 'JJ'),
('20200106', 101, 'test_1', 1, '12345', 120, 'JJ'),
('20200107', 101, 'test_1', 1, '12345', 120, 'JJ'),
('20200108', 101, 'test_2', 2, '12345', 200, 'JJ'),
('20200109', 101, 'test_1', 1, '12345', 120, 'TT'),
('20200110', 100, 'test_1', 0, '123', 120, 'JJ');
SELECT LogDate, v_1, v_2, v_3, v_4, v_5, v_6
FROM #tLog
EXCEPT
SELECT LEAD(LogDate) OVER (ORDER BY LogDate), v_1, v_2, v_3, v_4, v_5, v_6
FROM #tLog
ORDER BY LogDate;
And here's a copy of the results of the above. Note that only on the 2nd, 3rd and 7th did the data not change from the previous day.
LogDate v_1 v_2 v_3 v_4 v_5 v_6
--------------- ----------------------------
2020-01-01 100 test_1 0 123 120 JJ
2020-01-04 101 test_1 1 123 120 JJ
2020-01-05 100 test_1 0 123 120 JJ
2020-01-06 101 test_1 1 12345 120 JJ
2020-01-08 101 test_2 2 12345 200 JJ
2020-01-09 101 test_1 1 12345 120 TT
2020-01-10 100 test_1 0 123 120 JJ
Note that I have removed the 'PARTITION BY' in the LEAD as there are no real partitions - it's just one row after the next. However there's a distinct chance you may need this when it comes to actual data.
Here's a DB<>fiddle with both the original and this cut-down one with the OP's data.

Remove Partial Duplicate Rows in SQL Server 2016

I have a data set that has some column where values match, but the rest of the column values do not. I need to delete duplicates where SubCategory of a lower level (Level2, Level3 and Level 4) "IS NOT NULL" but its corresponding "duplicate partner" (grouped by [SubCategory Level 1 ID], [Product Category] and [Product Name]) has the same lower level SubCategory - "IS NULL". Per table below I need to remove ID 2, 4, 6 and 9 (see highlighted in red font).
I've tried Dense_Rank, Rank and Row_Number functions with Partition By but that did not give me the disired output. Maybe I need to use their combination...
Eg.: RowID 1 and 2 are duplicates by [Product Category], [Product Name], [Category Level 1]. "Category Level 1" is just an ID of "Product Category". In need to remove RowID 2 because its corresponding duplicate partner RowID 1 has no "Category Level 3" assigned when RowID 2 has. Same logic applues to RowID 9 and 10, but at this time RowID 9 has "Category Level 2" where Row 10 does not. If both duplicates (RowID 1 and 2) would have "Category Level 3" assigned we would not need to delete any of them
IF OBJECT_ID('tempdb..#Category', 'U') IS NOT NULL
DROP TABLE #Category;
GO
CREATE TABLE #Category
(
RowID INT NOT NULL,
CategoryID INT NOT NULL,
ProductCategory VARCHAR(100) NOT NULL,
ProductName VARCHAR(100) NOT NULL,
[SubCategory Level 1 ID] INT NOT NULL,
[SubCategory Level 2 ID] INT NULL,
[SubCategory Level 3 ID] INT NULL,
[SubCategory Level 4 ID] INT NULL
);
INSERT INTO #Category (RowID, CategoryID, ProductCategory, ProductName, [SubCategory Level 1 ID], [SubCategory Level 2 ID], [SubCategory Level 3 ID], [SubCategory Level 4 ID])
VALUES
(1, 111, 'Furniture', 'Table', 200, 111, NULL, NULL),
(2, 234, 'Furniture', 'Table', 200, 234, 123, NULL),
(3, 122, 'Furniture', 'Chair', 200, 122, NULL, NULL),
(4, 122, 'Furniture', 'Chair', 200, 122, 32, NULL),
(5, 12, 'Auto', 'Trucks', 300, 766, 12, NULL),
(6, 3434, 'Auto', 'Trucks', 300, 322, 3434, 333),
(7, 332, 'Auto', 'Sport Vehicles', 300, 332, NULL, NULL),
(8, 332, 'Auto', 'Sport Vehicles', 300, 332, NULL, NULL),
(9, 300, 'Auto', 'Sedans', 300, 231, NULL, NULL),
(10, 300, 'Auto', 'Sedans', 300, NULL, NULL, NULL),
(11, 300, 'Auto', 'Cabriolet', 300, 456, 688, NULL),
(12, 300, 'Auto', 'Cabriolet', 300, 456, 976, NULL),
(13, 300, 'Auto', 'Motorcycles', 300, 456, 235, 334),
(14, 300, 'Auto', 'Motorcycles', 300, 456, 235, 334);
SELECT * FROM #Category;
-- ADD YOU CODE HERE TO RETURN the following RowIDs: 2, 4, 6, 9
If I understand this right, your logic is the following:
For each unique SubCategory Level 1, Product Category, and Product Name combination, you want to return the row which has the least amount of filled in SubCategory level data.
Using a quick dense_rank with partitions on the relevant fields, you can order the rows with less Sub Categories levels to be set to 1. Rows 2, 4, 6, and 9 should now be the only rows returned.
;with DataToSelect
as
(
SELECT *,
DENSE_RANK() OVER(PARTITION BY [ProductCategory], [ProductName], [SubCategory Level 1 ID]
ORDER BY
CASE
WHEN [SubCategory Level 4 ID] IS NOT NULL THEN 3
WHEN [SubCategory Level 3 ID] IS NOT NULL THEN 2
WHEN [SubCategory Level 2 ID] IS NOT NULL THEN 1
END) as [ToInclude]
FROM #Category
)
SELECT *
FROM
DataToSelect
WHERE
ToInclude != 1
ORDER BY
RowID
Keep in mind if you have two rows with the same SubCategory level per SubCategory Level 1, Product Category, and Product Name combination, they'll both be included. If you do not want this, just swap the dense_rank to row_number and add some alternative criteria on which should be selected first.
this thread helped me a lot with understanding a different method to removing duplicate date. I want to thank the original contributors. I did however notice that the final solution is incomplete. The original poster wanted the results to return RowId's 2,4,6,9 however the ToInclude != 1 filter doesnt allow that. I am adding the code to complete the query by adding a where > 1 filter which will produce the intended result. See the code below:
;with DataToSelect
as
(
SELECT *,
DENSE_RANK() OVER(PARTITION BY [ProductCategory], [ProductName], [SubCategory Level 1 ID]
ORDER BY
CASE
WHEN [SubCategory Level 4 ID] IS NOT NULL THEN 3
WHEN [SubCategory Level 3 ID] IS NOT NULL THEN 2
WHEN [SubCategory Level 2 ID] IS NOT NULL THEN 1
END) as [ToInclude]
FROM #Category
)
SELECT *
FROM
DataToSelect
WHERE
ToInclude > 1
ORDER BY
RowID
This returns:
Results Table of Code

How could I replace a T-SQL cursor?

I would like to ask you how I could replace a cursor that I've inserted into my stored procedure.
Actually, we found that cursor is the only way out to manage my scenario, but as I've read this is not a best practise.
This is my scenario:I have to calculate recursively the stock row by row and set the season according to what has been calculated in the previous rows.
I can set the season when the transfer type is "purchase". The others transfers should be set with the correct season by a T-SQL query.
The table where I should calculate the season has the following template and fake data, but they reflect the real situation:
Transfer Table Example
The rows that have the "FlgSeason" set as null, are calculated as follow: in ascending order, the cursor start from the row 3 and go back the previous rows and calculate the amount of stock for each season and then update the column season with the minimum season with stock.
Here's the code I used:
CREATE TABLE [dbo].[transfers]
(
[rowId] [int] NULL,
[area] [int] NULL,
[store] [int] NULL,
[item] [int] NULL,
[date] [date] NULL,
[type] [nvarchar](50) NULL,
[qty] [int] NULL,
[season] [nvarchar](50) NULL,
[FlagSeason] [int] NULL
) ON [PRIMARY]
INSERT INTO [dbo].[transfers]
([rowId]
,[area]
,[store]
,[item]
,[date]
,[type]
,[qty]
,[season]
,[FlagSeason])
VALUES (1,1,20,300,'2015-01-01','Purchase',3,'2015-FallWinter',1)
, (2,1,20,300,'2015-01-01','Purchase',4,'2016-SpringSummer',1)
, (3,1,20,300,'2015-01-01','Sales',-1,null,null)
, (4,1,20,300,'2015-01-01','Sales',-2,null,null)
, (5,1,20,300,'2015-01-01','Sales',-1,null,null)
, (6,1,20,300,'2015-01-01','Sales',-1,null,null)
, (7,1,20,300,'2015-01-01','Purchase',4,'2016-FallWinter',1)
, (8,1,20,300,'2015-01-01','Sales',-1,null,null)
DECLARE #RowId as int
DECLARE db_cursor CURSOR FOR
Select RowID
from Transfers
where [FlagSeason] is null
order by RowID
OPEN db_cursor
FETCH NEXT FROM db_cursor INTO #RowId
WHILE ##FETCH_STATUS = 0
BEGIN
Update Transfers
set Season = (Select min (Season) as Season
from (
Select
Season
, SUM(QTY) as Qty
from Transfers
where RowID < #RowId
and [FlagSeason] = 1
group by Season
having Sum(QTY) > 0
)S
where s.QTY >= 0
)
, [FlagSeason] = 1
where rowId = #RowId
FETCH NEXT FROM db_cursor INTO #RowId
end
In this case the query would extract:
3 qty for season 2015 FW
4 for 2016 SS.
Than The update statment will set 2015-fw (the min over the two season with qty).
Then the courson go forward the row 4, and runs again the query to extract the stock updated considering the calculation at row 3. So the result should be
QTY 2 For 2015 FW
QTY 4 FOr 2016 SS
and then the update would set 2015 FW.
And so on.
The final output should be something like this:
Output
Actually, the only way-out was to implement a cursor and now it takes above 30/40 minutes to scan and update about 2,5 million rows. Do anybody know a solution without recurring to a cursor?
Thanks in advance!
Updated to run on 2008
IF OBJECT_ID('tempdb..#transfer') IS NOT NULL
DROP TABLE #transfer;
GO
CREATE TABLE #transfer (
RowID INT IDENTITY(1, 1) PRIMARY KEY NOT NULL,
Area INT,
Store INT,
Item INT,
Date DATE,
Type VARCHAR(50),
Qty INT,
Season VARCHAR(50),
FlagSeason INT
);
INSERT INTO #transfer ( Area,
Store,
Item,
Date,
Type,
Qty,
Season,
FlagSeason
)
VALUES (1, 20, 300, '20150101', 'Purchase', 3, '2015-SpringSummer', 1),
(1, 20, 300, '20150601', 'Purchase', 4, '2016-SpringSummer', 1),
(1, 20, 300, '20150701', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20150721', 'Sales', -2, NULL, NULL),
(1, 20, 300, '20150901', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20160101', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20170101', 'Purchase', 4, '2017-SpringSummer', 1),
(1, 20, 300, '20170125', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20170201', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20170225', 'Sales', -1, NULL, NULL),
(1, 21, 301, '20150801', 'Purchase', 4, '2017-SpringSummer', 1),
(1, 21, 301, '20150901', 'Sales', -1, NULL, NULL),
(1, 21, 301, '20151221', 'Sales', -2, NULL, NULL),
(1, 21, 302, '20150801', 'Purchase', 1, '2016-SpringSummer', 1),
(1, 21, 302, '20150901', 'Purchase', 1, '2017-SpringSummer', 1),
(1, 21, 302, '20151101', 'Sales', -1, NULL, NULL),
(1, 21, 302, '20151221', 'Sales', -1, NULL, NULL),
(1, 20, 302, '20150801', 'Purchase', 1, '2016-SpringSummer', 1),
(1, 20, 302, '20150901', 'Purchase', 1, '2017-SpringSummer', 1),
(1, 20, 302, '20151101', 'Sales', -1, NULL, NULL),
(1, 20, 302, '20151221', 'Sales', -1, NULL, NULL);
WITH Purchases
AS (SELECT t1.RowID,
t1.Area,
t1.Store,
t1.Item,
t1.Date,
t1.Type,
t1.Qty,
t1.Season,
RunningInventory = ( SELECT SUM(t2.Qty)
FROM #transfer AS t2
WHERE t1.Type = t2.Type
AND t1.Area = t2.Area
AND t1.Store = t2.Store
AND t1.Item = t2.Item
AND t2.Date <= t1.Date
)
FROM #transfer AS t1
WHERE t1.Type = 'Purchase'
),
Sales
AS (SELECT t1.RowID,
t1.Area,
t1.Store,
t1.Item,
t1.Date,
t1.Type,
t1.Qty,
t1.Season,
RunningSales = ( SELECT SUM(ABS(t2.Qty))
FROM #transfer AS t2
WHERE t1.Type = t2.Type
AND t1.Area = t2.Area
AND t1.Store = t2.Store
AND t1.Item = t2.Item
AND t2.Date <= t1.Date
)
FROM #transfer AS t1
WHERE t1.Type = 'Sales'
)
SELECT Sales.RowID,
Sales.Area,
Sales.Store,
Sales.Item,
Sales.Date,
Sales.Type,
Sales.Qty,
Season = ( SELECT TOP 1
Purchases.Season
FROM Purchases
WHERE Purchases.Area = Sales.Area
AND Purchases.Store = Sales.Store
AND Purchases.Item = Sales.Item
AND Purchases.RunningInventory >= Sales.RunningSales
ORDER BY Purchases.Date, Purchases.Season
)
FROM Sales
UNION ALL
SELECT Purchases.RowID ,
Purchases.Area ,
Purchases.Store ,
Purchases.Item ,
Purchases.Date ,
Purchases.Type ,
Purchases.Qty ,
Purchases.Season
FROM Purchases
ORDER BY Sales.Area, Sales.Store, item, Sales.Date
*original answer below**
I don't understand the purpose of the flagseason column so I didn't include that. Essentially, this calculates a running sum for purchases and sales and then finds the season that has a purchase_to_date inventory of at least the sales_to_date outflow for each sales transaction.
IF OBJECT_ID('tempdb..#transfer') IS NOT NULL
DROP TABLE #transfer;
GO
CREATE TABLE #transfer (
RowID INT IDENTITY(1, 1) PRIMARY KEY NOT NULL,
Area INT,
Store INT,
Item INT,
Date DATE,
Type VARCHAR(50),
Qty INT,
Season VARCHAR(50),
FlagSeason INT
);
INSERT INTO #transfer ( Area,
Store,
Item,
Date,
Type,
Qty,
Season,
FlagSeason
)
VALUES (1, 20, 300, '20150101', 'Purchase', 3, '2015-FallWinter', 1),
(1, 20, 300, '20150601', 'Purchase', 4, '2016-SpringSummer', 1),
(1, 20, 300, '20150701', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20150721', 'Sales', -2, NULL, NULL),
(1, 20, 300, '20150901', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20160101', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20170101', 'Purchase', 4, '2016-FallWinter', 1),
(1, 20, 300, '20170201', 'Sales', -1, NULL, NULL);
WITH Inventory
AS (SELECT *,
PurchaseToDate = SUM(CASE WHEN Type = 'Purchase' THEN Qty ELSE 0 END) OVER (ORDER BY Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
SalesToDate = ABS(SUM(CASE WHEN Type = 'Sales' THEN Qty ELSE 0 END) OVER (ORDER BY Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
FROM #transfer
)
SELECT Inventory.RowID,
Inventory.Area,
Inventory.Store,
Inventory.Item,
Inventory.Date,
Inventory.Type,
Inventory.Qty,
Season = CASE
WHEN Inventory.Season IS NULL
THEN ( SELECT TOP 1
PurchaseToSales.Season
FROM Inventory AS PurchaseToSales
WHERE PurchaseToSales.PurchaseToDate >= Inventory.SalesToDate
ORDER BY Inventory.Date
)
ELSE
Inventory.Season
END,
Inventory.PurchaseToDate,
Inventory.SalesToDate
FROM Inventory;
*UPDATED*******************************
You'll need an index on your data to help with the sorting in order to make this perform.
Possibly:
CREATE NONCLUSTERED INDEX IX_Transfer ON #transfer(Store, Item, Date) INCLUDE(Area,Qty,Season,Type)
You should see a index scan on the named index. It will not be a seek because the sample query does not filter any data and all of the data is included.
In addition, you need to remove Season from the Partition By clause of the SalesToDate. Resetting the sales for each season will throw your comparisons off because the rolling sales need to be compared to the rolling inventory in order for you to determine the source of sales inventory.
Two other tips for the partition clause:
Don't duplicate the fields between partition by and order by. The order of the partition fields doesn't matter since the aggregate is reset for each partition. At best, the ordered partition field will be ignored, at worst it may cause the optimizer to aggregate the fields in a particular order. This does not have any effect on the results, but can added unnecessary overhead.
Make sure your index matches the definition of the partition by/order by clause.
The index should be [partitioning fields, sequence doesn't matter] + [ordering fields, sequence needs to match order by clause].
In your scenario, the indexed columns should be on store, item, and then date. If date were before store or item, the index would not be used because the optimizer will need to first handle partitioning by store & item before sorting by date.
If you may have multiple areas in your data, the index and partition clauses would need to be
index: area, store, item, date
partition by: area, store, item order by date
Referring to Wes's answer, the solution proposed is almost fine. It works good but I've noticed that the assignment of the season doesn't work properly beacause, in my scenario, the stock should be calculated and updated by store and item itself. I've Updated the script adding some adjstments. Moreover, I've added some new "Fake" data to understand better my scenario and how it should work.
IF OBJECT_ID('tempdb..#transfer') IS NOT NULL
DROP TABLE #transfer;
GO
CREATE TABLE #transfer (
RowID INT IDENTITY(1, 1) PRIMARY KEY NOT NULL,
Area INT,
Store INT,
Item INT,
Date DATE,
Type VARCHAR(50),
Qty INT,
Season VARCHAR(50),
FlagSeason INT
);
INSERT INTO #transfer ( Area,
Store,
Item,
Date,
Type,
Qty,
Season,
FlagSeason
)
VALUES (1, 20, 300, '20150101', 'Purchase', 3, '2015-SpringSummer', 1),
(1, 20, 300, '20150601', 'Purchase', 4, '2016-SpringSummer', 1),
(1, 20, 300, '20150701', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20150721', 'Sales', -2, NULL, NULL),
(1, 20, 300, '20150901', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20160101', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20170101', 'Purchase', 4, '2017-SpringSummer', 1),
(1, 20, 300, '20170125', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20170201', 'Sales', -1, NULL, NULL),
(1, 20, 300, '20170225', 'Sales', -1, NULL, NULL),
(1, 21, 301, '20150801', 'Purchase', 4, '2017-SpringSummer', 1),
(1, 21, 301, '20150901', 'Sales', -1, NULL, NULL),
(1, 21, 301, '20151221', 'Sales', -2, NULL, NULL),
(1, 21, 302, '20150801', 'Purchase', 1, '2016-SpringSummer', 1),
(1, 21, 302, '20150901', 'Purchase', 1, '2017-SpringSummer', 1),
(1, 21, 302, '20151101', 'Sales', -1, NULL, NULL),
(1, 21, 302, '20151221', 'Sales', -1, NULL, NULL),
(1, 20, 302, '20150801', 'Purchase', 1, '2016-SpringSummer', 1),
(1, 20, 302, '20150901', 'Purchase', 1, '2017-SpringSummer', 1),
(1, 20, 302, '20151101', 'Sales', -1, NULL, NULL),
(1, 20, 302, '20151221', 'Sales', -1, NULL, NULL)
;
WITH Inventory
AS (SELECT *,
PurchaseToDate = SUM(CASE WHEN Type = 'Purchase' THEN Qty ELSE 0 END) OVER (partition by store, item ORDER BY store, item,Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW),
SalesToDate = ABS(SUM(CASE WHEN Type = 'Sales' THEN Qty ELSE 0 END) OVER (partition by store, item,season ORDER BY store, item, Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW))
FROM #transfer
)
SELECT Inventory.RowID,
Inventory.Area,
Inventory.Store,
Inventory.Item,
Inventory.Date,
Inventory.Type,
Inventory.Qty,
Season = CASE
WHEN Inventory.Season IS NULL
THEN ( SELECT TOP 1
PurchaseToSales.Season
FROM Inventory AS PurchaseToSales
WHERE PurchaseToSales.PurchaseToDate >= Inventory.SalesToDate
and PurchaseToSales.Item = inventory.item --//Added
and PurchaseToSales.store = inventory.store --//Added
and PurchaseToSales.Area = Inventory.area --//Added
ORDER BY Inventory.Date
)
ELSE
Inventory.Season
END,
Inventory.PurchaseToDate,
Inventory.SalesToDate
FROM Inventory
Here the output:
enter image description here
After these adjustments, it works fine, but if I switch the fake data with the real data that are within in a 6 milions row data table, the query becomes very slow (~400 rows extracted per minutes) because of the insert of these check inside the where clause of the subquery:
WHERE PurchaseToSales.PurchaseToDate >= Inventory.SalesToDate
and PurchaseToSales.Item = inventory.item --//Added
and PurchaseToSales.store = inventory.store --//Added
and PurchaseToSales.Area = Inventory.area --//Added
I've tryed to replace the subquery with the "Cross Apply" function but nothing has changed. Am I Missing somethings?
Thanks in advance

Add time between 2 dates across multiple rows in SQL Server

I have a table that lists all users for my company. There are multiple entries for each staff member showing how they have been employed.
RowID UserID FirstName LastName Title StartDate Active EndDate
-----------------------------------------------------------------------------------
1 1 John Smith Manager 2017-01-01 0 2017-01-31
2 1 John Smith Director 2017-02-01 0 2017-02-28
3 1 John Smith CEO 2017-03-01 1 NULL
4 2 Sam Davey Manager 2017-01-01 0 2017-02-28
5 2 Sam Davey Manager 2017-03-01 0 NULL
6 3 Hugh Holland Admin 2017-02-01 1 NULL
7 4 David Smith Admin 2017-01-01 0 2017-02-28
I am trying to write a query that will tell me someones length of service at any given time.
The part I am having trouble with is as a single person is represented by multiple rows as their information changes over time I need combine multiple rows...
I have a query to report on who is employed at a point in time which is as far as I have gotten.
DECLARE #DateCheck datetime
SET #DateCheck = '2017/05/10'
SELECT *
FROM UsersTest
WHERE #DateCheck >= StartDate AND #DateCheck <= ISNULL(EndDate, #DateCheck)
You need to use the datediff function. The key will be choosing the appropriate number - days, months, years. The return value is an integer so if you choose years, it will be rounded (and remember, it will round for each record, not for the summary. I've chosen months below. The following has been added to get the most recent information for user name:
WITH CurrentName AS
(SELECT UserID, FirstName, LastName
from
UserStartStop
where Active = 1 -- You can replace this with a date check
)
SELECT uss.UserID,
MAX(cn.FirstName) as FirstName, -- the max is necessary because we are
-- grouping. Could include in group by
MAX(cn.LastName) as LastName,
SUM(DATEDIFF(mm,uss.StartDate,COALESCE(uss.EndDate,GETDATE())))
from UserStartStop uss
JOIN CurrentName cn
on uss.UserID = cn.UserID
GROUP BY UserID
order by UserID
For months in service, change 'd' to 'mm':
Create table #UsersTest (
RowId int
, UserID int
, FirstName nvarchar(100)
, LastName nvarchar(100)
, Title nvarchar(100)
, StartDate date
, Active bit
, EndDate date)
Insert #UsersTest values (1, 1, 'John', 'Smith', 'Manager', '2017-01-01', 0, '2017-01-31')
Insert #UsersTest values (1, 1, 'John', 'Smith', 'Director', '2017-02-01', 0, '2017-02-28')
Insert #UsersTest values (1, 1, 'John', 'Smith', 'CEO', '2017-03-01', 1, null)
Insert #UsersTest values (1, 2, 'Sam', 'Davey', 'Manager', '2017-01-01', 0, '2017-02-28')
Insert #UsersTest values (1, 2, 'Sam', 'Davey', 'Manager', '2017-03-01', 0, null)
Insert #UsersTest values (1, 3, 'Hugh', 'Holland', 'Admin', '2017-02-01', 1, null)
Insert #UsersTest values (1, 4, 'David', 'Smith', 'Admin', '2017-01-01', 0, '2017-02-28')
Declare #DateCheck as datetime = '2017/05/10'
Select UserID, FirstName, LastName
, Datediff(d, Min([StartDate]), iif(isnull(Max([EndDate]),'1900-01-01')<#DateCheck, #DateCheck ,Max([Enddate]))) as [LengthOfService]
from #UsersTest
Group by UserID, FirstName, LastName
Try it's
Select
FirstName,
LastName,
Min(StartDate)StartDate,
Max(isnull(EndDate,getdate()) as EndDate
from Table

SQL Server conditional subtotal query

given the following table:
create table #T
(
user_id int,
project_id int,
datum datetime,
status varchar(10),
KM int
)
insert into #T values
(1, 1, '20160301 10:25', 'START', 1000),
(1, 1, '20160301 10:28', 'PASS', 1008),
(2, 2, '20160301 10:29', 'START', 2000),
(1, 1, '20160301 11:08', 'STOP', 1045),
(3, 3, '20160301 10:25', 'START', 3000),
(2, 2, '20160301 10:56', 'STOP', 2020),
(1, 4, '20160301 15:00', 'START', 1045),
(4, 5, '20160301 15:10', 'START', 400),
(1, 4, '20160301 15:10', 'PASS', 1060),
(1, 4, '20160301 15:20', 'PASS', 1080),
(1, 4, '20160301 15:30', 'STOP', 1080),
(4, 5, '20160301 15:40', 'STOP', 450),
(3, 3, '20160301 16:25', 'STOP', 3200)
I have to sum the length of a track between START and STOP statuses for a given user and project
The expected result would be this:
user_id project_id datum TOTAL_KM
----------- ----------- ---------- -----------
1 1 2016-03-01 45
1 4 2016-03-01 35
2 2 2016-03-01 20
3 3 2016-03-01 200
4 5 2016-03-01 50
How can I achieve this without using a cluster?
The performance is an issue (I have over 1 million records per month and we have to keep data for several years)
Explanation:
We can ignore the records with the status "PASS". Basically we have to subtract the KM value of the START record from the STOP record for a given user and project.
There can be several hundred records between a START and STOP (like describes in the sample data)
The date should be the date of START (in case where we have an over midnight delivery)
I think I should have a SELECT with an OVER() clause but I don't know how to formulate my query to respect those conditions.
Any idea?
SELECT t.[user_id],
t.project_id,
cast(t.datum as date) as datum,
t1.KM- t.KM as KM
FROM #T t
INNER JOIN #T t1
ON t.[user_id]=t1.[user_id] and t.project_id = t1.project_id
WHERE t.[status] = 'START' and t1.[status] = 'STOP'
ORDER BY t.[user_id],
t.project_id,
cast(t.datum as date)
Output:
user_id project_id datum KM
----------- ----------- ---------- -----------
1 1 2016-03-01 45
1 4 2016-03-01 35
2 2 2016-03-01 20
3 3 2016-03-01 200
4 5 2016-03-01 50
(5 row(s) affected)
This could be achieved by simple self join.
One of the example: (this may not be exact query but just an idea)
Select
a.user_id,
a.project_id,
b.datum as StartDate,
a.KM-b.KM as TotalKM
From #T a
Where status = 'STOP'
Join
(
Select user_id, project_id, KM From #t Where
status = 'START'
) b ON b.user_id = a.user_id, b.project_id = a.project_id
#T b

Resources