SQL Server CTE Use Previous Computed Date as Next Start Date - sql-server

I have a table that holds tasks. Each task has an allotted number of hours that it's supposed to take to complete the task.
I'm storing the data in a table, like so:
declare #fromtable table (recordid int identity(1,1), orderdate date, deptid int, task varchar(500), estimatedhours int);
I also have a function that calculates the completion date of the task, based on the start date, estimated hours, and department, and some other math that computes headcount, hours available to work, etc.
dbo.fn_getCapEndDate(aStartDate,estimatedHours,deptID)
I need to generate the start and end date for each record in #fromtable. The first record will start with column orderdate as the start date for the computation, then each subsequent record will use the previous record's computedEndDate as their start date.
What I'm trying to achieve:
Here's what I have started with:
with MyCTE as
(
select mt.recordID, mt.deptID, mt.estimatedhours, mt.JobNumber, ROW_NUMBER() over (order by recordID) as RowNum,
convert(date,mt.orderdate) as computedStart,
case when mt.recordID = 1 then convert(date,dbo.fn_getCapEndDate(mt.orderdate,mt.estimatedhours,mt.deptid)) end as computedEnd
from #fromtable mt
)
select c1.*, c2.recordID,
case when c2.recordid is null then c1.computedStart else c2.computedEnd end as StartDate,
case when c2.recordid is null then c1.computedEnd else dbo.fn_getCapEndDate(c2.computedEnd,c1.estimatedhours,c1.deptid) end as computedEnd
from MyCTE c1
left join MyCTE c2 on c1.RowNum = c2.RowNum + 1;
With this, the first two columns have the correct start/end dates. Every column after that computes NULL for its start and end values. It "loses" the value of the previous column's computed end date.
What can I do to fix the issue and return the values as needed?
EDIT: Sample data in text format:
estimatedhours OrderDate
0 1/1/2017
0 1/1/2017
0 1/1/2017
0 1/1/2017
500 1/1/2017
32 1/1/2017
0 1/1/2017
0 1/1/2017
320 1/1/2017
0 1/1/2017
5 1/1/2017
0 1/1/2017
4 1/1/2017

You can use lead as below:
select RecordId, EstimatedHours, StartDate,
ComputedEnd = LEAD(StartDate) over (order by RecordId)
From yourTable

Related

Select a grouped table (by Id) filtered by a datetime column according to NULL and MAX(date) values

Imagine that I have a table with pretty many columns in there, but that has to be returned filtered just by Id and EndDate.
Id
EndDate
...
1
NULL
1
01.01.2022 15:25
1
01.01.2022 15.24
2
15.01.2022 10:00
2
15.01.2022 11:00
2
17.01.2022 00:00
3
NULL
3
10.10.2022 22:12
4
18.05.2022 17:15
4
18.05.2022 17:17
4
19.05.2022 00:00
The resulting table must be the following:
Id
EndDate
...
1
NULL
2
17.01.2022 00:00
3
NULL
4
19.05.2022 00:00
The record with a specific Id must be picked either having a NULL EndDate value or MAX value otherwise. As it's seen on the resulting table, record with Id = 1 has NULL EndDate so then it must be picked, record with Id = 4 doesn't have a NULLable EndDate, so the value with MAX(EndDate) must be returned.
I was trying different scenarios with joining and UNIONing, but it seems desperate. Also, I considered something with CTE tables, but it seems irrelevant. The point is also get an optimal solution, because resulting table are considered to be joined to another table.
If there will be at least an idea of how to get a desired result, I would be appreciate.
You can use ROW_NUMBER in a common table expression to define the priority. Just replace the NULL with a date far in the future like 9999-12-31, then you can just order the date.
WITH cte
AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY Id ORDER BY ISNULL(EndDate,'99991231') DESC) AS RN
FROM dbo.myTable
)
SELECT *
FROM cte
WHERE cte.RN = 1;
With simple aggregation and a CASE expression where you check if there are any null dates for each Id:
SELECT Id,
CASE WHEN COUNT(*) = COUNT(EndDate) THEN MAX(EndDate) END AS EndDate
FROM tablename
GROUP BY Id;
The condition COUNT(*) = COUNT(EndDate) is satisfied only if all dates are not null.
See the demo.

Getting the Min(startdate) and Max(enddate) for an ID when that ID shows up multiple times

I have a table with a column for ID, StartDate, EndDate, And whether or not there was a gap between the enddate of that row and the next start date. If there was only one set instance of that ID i know that I could just do
SELECT min(startdate),max(enddate)
FROM table
GROUP BY ID
However, I have multiple instances of these IDs in several non-connected timespans. So if I were to do that I would get the very first start date and the last enddate for a different set of time for that personID. How would I go about making sure I get the min a max dates for the specific blocks of time?
I thought about potentially creating a new column where it would have a number for each set of time. So for the first set of time that has no gaps, it would have 1, then when the next row has a gap it will add +1 corresponding to a new set of time. but I am not really sure how to go about that. Here is some sample data to illustrate what I am working with:
ID StartDate EndDate NextDate Gap_ind
001 1/1/2018 1/31/2018 2/1/2018 N
001 2/1/2018 2/30/2018 3/1/2018 N
001 3/1/2018 3/31/2018 5/1/2018 Y
001 5/1/2018 5/31/2018 6/1/2018 N
001 6/1/2018 6/30/2018 6/30/2018 N
This is a classic "gaps and islands" problem, where you are trying to define the boundaries of your islands, and which you can solve by using some windowing functions.
Your initial effort is on track. Rather than getting the next start date, though, I used the previous end date to calculate the groupings.
The innermost subquery below gets the previous end date for each of your date ranges, and also assigns a row number that we use later to keep our groupings in order.
The next subquery out uses the previous end date to identify which groups of date ranges go together (overlap, or nearly so).
The outermost query is the end result you're looking for.
SELECT
Grp.ID,
MIN(Grp.StartDate) AS GroupingStartDate,
MAX(Grp.EndDate) AS GroupingEndDate
FROM
(
SELECT
PrevDt.ID,
PrevDt.StartDate,
PrevDt.EndDate,
SUM(CASE WHEN DATEADD(DAY,1,PrevDt.PreviousEndDate) >= PrevDt.StartDate THEN 0 ELSE 1 END)
OVER (PARTITION BY PrevDt.ID ORDER BY PrevDt.RN) AS GrpNum
FROM
(
SELECT
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY StartDate, EndDate) as RN,
ID,
StartDate,
EndDate,
LAG(EndDate,1) OVER (PARTITION BY ID ORDER BY StartDate) AS PreviousEndDate
FROM
tbl
) AS PrevDt
) AS Grp
GROUP BY
Grp.ID,
Grp.GrpNum;
Results:
+-----+------------------+--------------+
| ID | InitialStartDate | FinalEndDate |
+-----+------------------+--------------+
| 001 | 2018-01-01 | 2018-03-01 |
| 001 | 2018-05-01 | 2018-06-01 |
+-----+------------------+--------------+
SQL Fiddle demo.
Further reading:
The SQL of Gaps and Islands in Sequences
Gaps and Islands Across Date Ranges
This is an example of a gaps-and-islands problem. A simple solution is to use lag() to determine if there are overlaps. When there is none, you have the start of a group. A cumulative sum defines the group -- and you aggregate on that.
select t.id, min(startdate), max(enddate)
from (select t.*,
sum(case when prev_enddate >= dateadd(day, -1, startdate)
then 0 else 1
end) over (partition by id order by startdate) as grp
from (select t.*, lag(enddate) over (partition by id order by startdate) as prev_enddate
from t
) t
) t
group by id, grp;

Efficient way to forward-fill nulls in time-series data using T-SQL

I have a table with time-series data that's mostly nulls, and I want to fill in all of the nulls with the last known value.
I have a few solutions, but they're much slower than doing the equivalent DataFrame.fillna(method='ffill') operation in Pandas.
A simplified version of the code / data that I'm using:
select d.[date], d.[price],
(select top 1 p.price from price_table p
where p.price is not null and p.[date] <= p.[date]
order by p.[date] desc) as ff_price
from price_table d
To produce the table
date price ff_price
---------- ----- --------
2016-07-11 0.79 0.79
2016-07-12 NULL 0.79
2016-07-13 NULL 0.79
2016-07-14 0.69 0.69
2016-07-15 NULL 0.69
...
2016-09-21 0.88 0.88
...
I have >100 million rows, so this takes quite a while.
This looks like a "classic" gaps and island question. Assuming you're not using 2008 or prior (which are all (almost) entirely out of support) this should get you the result you're after:
WITH CTE AS(
SELECT [date],
price,
COUNT(CASE WHEN price IS NOT NULL THEN 1 END) OVER (ORDER BY [date]
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS Grp
FROM price_table p)
SELECT [date],
price,
MIN(price) OVER (PARTITION BY grp) AS ff_price
FROM CTE;
db<>fiddle
Assuming that your column is DATE and price is DECIMAL(5,2), please test this approach:
SELECT
P.[date],
P.[price],
ff_price = CONVERT(
DECIMAL(5,2), -- Original price datatype
SUBSTRING(
MAX(
CAST(P.[date] AS BINARY(3)) + -- 3: datalength of P.[date] column
CAST(P.[price] AS BINARY(5)) -- 5: datalength of P.[price] column
) OVER (ORDER BY P.[date] ROWS UNBOUNDED PRECEDING),
4, -- Position to start that's not the binary part of the date
5))-- Characters that compose the binary of the original price datatype
FROM
price_table AS P
This is a solution I implemented with a similar problem and you can find the exaustive explanation here. The reason this approach is good is because it doesn't require a explicit sort, as long as you have an index by date.
What it does is basically use a windowed MAX with the concatenation of the 3 bytes that composes your date column (this is why I mentioned that you column must be DATE, otherwise DATETIME will need 8 bytes, you can edit the query to work with this) with the bytes that compose your price column (which are 5 bytes, also assumed). This is the CAST(P.[date] AS BINARY(3)) + CAST(P.[price] AS BINARY(5)) part.
When you calculate this and ORDER BY P.[date] ROWS UNBOUNDED PRECEDING, the engine is basically doing rolling max with values which most significant bytes are your dates. The max result will always update when the date changes, but considering that concatenating any value with NULL as price will also yield NULL (as binary), then the MAX will always ignore this value and retain the previous non-null MAX (by P.[date] ROWS UNBOUNDED PRECEDING).
This is the binary result of the windowed MAX (I added a previous record with NULL so you see that result is NULL for null prices values):
date price ff_price WindowedMax
2016-07-10 NULL NULL NULL
2016-07-11 0.79 0.79 0x9B3B0B050200014F
2016-07-12 NULL 0.79 0x9B3B0B050200014F
2016-07-13 NULL 0.79 0x9B3B0B050200014F
2016-07-14 0.69 0.69 0x9E3B0B0502000145
2016-07-15 NULL 0.69 0x9E3B0B0502000145
2016-07-21 0.88 0.88 0xA53B0B0502000158
2016-07-22 NULL 0.88 0xA53B0B0502000158
You can also use APPLY :
SELECT t.*, t1.price AS ff_price
FROM price_table t OUTER APPLY
(SELECT TOP (1) t1.*
FROM price_table t1
WHERE t1.[date] <= t.[date] AND t1.price IS NOT NULL
ORDER BY t1.[date] DESC
) t1;

Populating a list of dates without a defined end date - SQL server

I have a list of accounts and their cost which changes every few days.
In this list I only have the start date every time the cost updates to a new one, but no column for the end date.
Meaning, I need to populate a list of dates when the end date for a specific account and cost, should be deduced as the start date of the same account with a new cost.
More or less like that:
Account start date cost
one 1/1/2016 100$
two 1/1/2016 150$
one 4/1/2016 200$
two 3/1/2016 200$
And the result I need would be:
Account date cost
one 1/1/2016 100$
one 2/1/2016 100$
one 3/1/2016 100$
one 4/1/2016 200$
two 1/1/2016 150$
two 2/1/2016 150$
two 3/1/2016 200$
For example, if the cost changed in the middle of the month, than the sample data will only hold two records (one per each unique combination of account-start date-cost), while the results will hold 30 records with the cost for each and every day of the month (15 for the first cost and 15 for the second one). The costs are a given, and no need to calculate them (inserted manually).
Note the result contains more records because the sample data shows only a start date and an updated cost for that account, as of that date. While the results show the cost for every day of the month.
Any ideas?
Solution is a bit long.
I added an extra date for test purposes:
DECLARE #t table(account varchar(10), startdate date, cost int)
INSERT #t
values
('one','1/1/2016',100),('two','1/1/2016',150),
('one','1/4/2016',200),('two','1/3/2016',200),
('two','1/6/2016',500) -- extra row
;WITH CTE as
( SELECT
row_number() over (partition by account order by startdate) rn,
*
FROM #t
),N(N)AS
(
SELECT 1 FROM(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1))M(N)
),
tally(N) AS -- tally is limited to 1000 days
(
SELECT ROW_NUMBER()OVER(ORDER BY N.N) - 1 FROM N,N a,N b
),GROUPED as
(
SELECT
cte.account, cte.startdate, cte.cost, cte2.cost cost2, cte2.startdate enddate
FROM CTE
JOIN CTE CTE2
ON CTE.account = CTE2.account
and CTE.rn = CTE2.rn - 1
)
-- used DISTINCT to avoid overlapping dates
SELECT DISTINCT
CASE WHEN datediff(d, startdate,enddate) = N THEN cost2 ELSE cost END cost,
dateadd(d, N, startdate) startdate,
account
FROM grouped
JOIN tally
ON datediff(d, startdate,enddate) >= N
Result:
cost startdate account
100 2016-01-01 one
100 2016-01-02 one
100 2016-01-03 one
150 2016-01-01 two
150 2016-01-02 two
200 2016-01-03 two
200 2016-01-04 one
200 2016-01-04 two
200 2016-01-05 two
500 2016-01-06 two
Thank you #t-clausen.dk!
It didn't solve the problem completely, but did direct me in the correct way.
Eventually I used the LEAD function to generate an end date for every cost per account, and then I was able to populate a list of dates based on that idea.
Here's how I generate the end dates:
DECLARE #t table(account varchar(10), startdate date, cost int)
INSERT #t
values
('one','1/1/2016',100),('two','1/1/2016',150),
('one','1/4/2016',200),('two','1/3/2016',200),
('two','1/6/2016',500)
select account
,[startdate]
,DATEADD(DAY, -1, LEAD([Startdate], 1,'2100-01-01') OVER (PARTITION BY account ORDER BY [Startdate] ASC)) AS enddate
,cost
from #t
It returned the expected result:
account startdate enddate cost
one 2016-01-01 2016-01-03 100
one 2016-01-04 2099-12-31 200
two 2016-01-01 2016-01-02 150
two 2016-01-03 2016-01-05 200
two 2016-01-06 2099-12-31 500
Please note that I set the end date of current costs to be some date in the far future which means (for me) that they are currently active.

How to get a derive 'N' Date Rows from a single record with From / To date columns?

Title sounds confusing but let me please explain:
I have a table that has two columns that provide a date range, and one column that provides a value. I need to query that table and "detail" the data such as this
Is it possible to do only using TSQL?
Additional Info
The table in question is about 2-3million records long (and growing)
Assuming the range of dates is fairly narrow, an alternative is to use a recursive CTE to create a list of all dates in the range and then join interpolate to it:
WITH LastDay AS
(
SELECT MAX(Date_To) AS MaxDate
FROM MyTable
),
Days AS
(
SELECT MIN(Date_From) AS TheDate
FROM MyTable
UNION ALL
SELECT DATEADD(d, 1, TheDate) AS TheDate
FROM Days CROSS JOIN LastDay
WHERE TheDate <= LastDay.MaxDate
)
SELECT mt.Item_ID, mt.Cost_Of_Item, d.TheDate
FROM MyTable mt
INNER JOIN Days d
ON d.TheDate BETWEEN mt.Date_From AND mt.Date_To;
I've also assumed an that date from and date to represent an inclusive range (i.e. includes both edges) - it is unusual to use inclusive BETWEEN on dates.
SqlFiddle here
Edit
The default MAXRECURSION on a recursive CTE in Sql Server is 100, which will limit the date range in the query to a span of 100 days. You can adjust this to a maximum of 32767.
Also, if you are filtering just a smaller range of dates in your large table, you can adjust the CTE to limit the number of days in the range:
WITH DateRange AS
(
SELECT CAST('2014-01-01' AS DATE) AS MinDate,
CAST('2014-02-16' AS DATE) AS MaxDate
),
Days AS
(
SELECT MinDate AS TheDate
FROM DateRange
UNION ALL
SELECT DATEADD(d, 1, TheDate) AS TheDate
FROM Days CROSS APPLY DateRange
WHERE TheDate <= DateRange.MaxDate
)
SELECT mt.Item_ID, mt.Cost_Of_Item, d.TheDate
FROM MyTable mt
INNER JOIN Days d
ON d.TheDate BETWEEN mt.Date_From AND mt.Date_To
OPTION (MAXRECURSION 0);
Update Fiddle
This can be achieved using Cursors.
I've simulated the test data provided and created another table with the name "DesiredTable" to store the data inside, and created the following cusror which achieved exactly what you are looking for:
SET NOCOUNT ON;
DECLARE #ITEM_ID int, #COST_OF_ITEM Money,
#DATE_FROM date, #DATE_TO date;
DECLARE #DateDiff INT; -- holds number of days between from & to columns
DECLARE #counter INT = 0; -- for loop counter
PRINT '-------- Begin the Date Expanding Cursor --------';
-- defining the cursor target statement
DECLARE Date_Expanding_Cursor CURSOR FOR
SELECT [ITEM_ID]
,[COST_OF_ITEM]
,[DATE_FROM]
,[DATE_TO]
FROM [dbo].[OriginalTable]
-- openning the cursor
OPEN Date_Expanding_Cursor
-- fetching next row data into the declared variables
FETCH NEXT FROM Date_Expanding_Cursor
INTO #ITEM_ID, #COST_OF_ITEM, #DATE_FROM, #DATE_TO
-- if next row is found
WHILE ##FETCH_STATUS = 0
BEGIN
-- calculate the number of days in between the date columns
SELECT #DateDiff = DATEDIFF(day,#DATE_FROM,#DATE_TO)
-- reset the counter to 0 for the next loop
set #counter = 0;
WHILE #counter <= #DateDiff
BEGIN
-- inserting rows inside the new table
insert into DesiredTable
Values (#COST_OF_ITEM, DATEADD(day,#counter,#DATE_FROM))
set #counter = #counter +1
END
-- fetching next row
FETCH NEXT FROM Date_Expanding_Cursor
INTO #ITEM_ID, #COST_OF_ITEM, #DATE_FROM, #DATE_TO
END
-- cleanup code
CLOSE Date_Expanding_Cursor;
DEALLOCATE Date_Expanding_Cursor;
The code fetches every row from your original table, then it calculates the number of days between DATE_FROM and DATE_TO columns, then using this number the script will create identical rows to be inserted inside the new table DesiredTable.
give it a try and let me know of the results.
You can generate an increment table and join it to your date From:
Query:
With inc(n) as (
Select ROW_NUMBER() over (order by (select 1)) -1 From (
Select 1 From (values(1), (1), (1), (1), (1), (1), (1), (1), (1), (1)) as x1(n)
Cross Join (values(1), (1), (1), (1), (1), (1), (1), (1), (1), (1)) as x2(n)
) as x(n)
)
Select item_id, cost, DATEADD(day, n, dateFrom), n From #dates d
Inner Join inc i on n <= DATEDIFF(day, dateFrom, dateTo)
Order by item_id
Output:
item_id cost Date n
1 100 2014-01-01 00:00:00.000 0
1 100 2014-01-02 00:00:00.000 1
1 100 2014-01-03 00:00:00.000 2
2 105 2014-01-08 00:00:00.000 2
2 105 2014-01-07 00:00:00.000 1
2 105 2014-01-06 00:00:00.000 0
2 105 2014-01-09 00:00:00.000 3
3 102 2014-02-14 00:00:00.000 3
3 102 2014-02-15 00:00:00.000 4
3 102 2014-02-16 00:00:00.000 5
3 102 2014-02-11 00:00:00.000 0
3 102 2014-02-12 00:00:00.000 1
3 102 2014-02-13 00:00:00.000 2
Sample Data:
declare #dates table(item_id int, cost int, dateFrom datetime, dateTo datetime);
insert into #dates(item_id, cost, dateFrom, dateTo) values
(1, 100, '20140101', '20140103')
, (2, 105, '20140106', '20140109')
, (3, 102, '20140211', '20140216');
Yet another way is to create and maintain calendar table, containing all dates for many years (in our app we have table for 30 years or so, extending every year). Then you can just link to calendar:
select <whatever you need>, calendar.day
from <your tables> inner join calendar on calendar.day between <min date> and <max date>
This approach allows to include additional information (holidays etc) in calendar table - sometimes very helpful.

Resources