Optimize insert of 1B+ rows - sql-server

I work in an organization that has over 75,000 employees. In our payroll system, each employee has 32 unique banks which store things like Sick Time, Vacation Time, Banked Overtime, etc.
Here are the existing tables
Employee
(
Employee_key INT IDENTITY(1,1)
Lastname,
Firstname
)
Employee Key | Lastname | Firstname
-----------------------------------
100 | Smith | John
Bank
(
Bank_key INT IDENTITY(1,1),
Bank_name VARCHAR(50)
)
Bank_key | Bank_name
---------------------
100 | VACATION
Employee_balance
(
Employee_key INT, --FK to Employee
Bank_key INT, --FK to Bank
Bank_balance NUMERIC(10,5) -- Aggregate value of bank including future dated entries
)
Employee_key | Bank_Key | Bank_Balance
--------------------------------------
100 | 100 | 0
Employee_balance_trans
(
Employee_key INT, --FK to Employee
Bank_key INT, --FK to Bank
Trans_dt DATE -- transaction date that affects the bank
Bank_delta NUMERIC(10,5)
)
Employee_key | Bank_key | Trans_dt | Balance_delta
--------------------------------------------------
100 | 100 | 20230701 | -8.0
100 | 100 | 20230801 | -8.0
100 | 100 | 20230901 | -8.0
100 | 100 | 20231001 | -8.0
100 | 100 | 20231101 | -8.0
This employee has 5 vacation days booked into the future, for a total of 40 hours. As of January 1, the employee had 40 hours in their vacation bank, but because the employee_balance table is net of all future dated entries, I have to do some SQL processing to get the value for a current date.
SELECT eb.employee_key,
eb.bank_key,
'2023-01-01',
eb.employee_balance - ISNULL(SUM(feb.balance_delta), 0)
FROM employee a
INNER JOIN employee_balance eb on eb.employee_key = a.employee_key
LEFT OUTER JOIN wfms.employee_balance_trans ebt ON ebt.balance_key = eb.balance_key
AND ebt.employee_key = eb.employee_key
AND ebt.trans_dt > '2023-01-01'
GROUP BY eb.employee_key, eb.balance_key, eb.employee_balance
Running this query using 2023-01-01 returns a bank value of 40 hours. Running the query on 2023-07-01 returns a value of 32 hours and so on. This query is fine for calculating a single employee balance. The problem starts when a manager of a department with 1000 employees wants to see a report showing the employee banks at the beginning and end of each month.
I created a new table as follows:
Employee_bank_history
(
employee_key INT, --FK to employee
bank_key INT, --FK to bank
bank_date DATE,
bank_balance NUMERIC (10,5) -- Contains the bank balance as of the bank date
)
The table has a unique clustered index consisting of employee_key, bank_key and bank_date. The table is also populated every evening with a date range from December 31 2021 to Current Date. The start date gets reset every year, so there will be a maximum of 730 days worth of data. This means that at the maximum date range of 730 days, there will be almost 2 billion rows. (75,000 employees X 32 banks X 730 days.)
Currently, I am loading 950 million rows, and the following INSERT statement takes 30-45 minutes.
DECLARE #StartDate DATE = DATEFROMPARTS(DATEPART (YY, GETDATE())-2, 12, 31)
DECLARE #EndDate DATE = GETDATE()
;
WITH cte_bank_dates AS
(
SELECT [date]
FROM dim_date
WHERE [date] BETWEEN #StartDate AND #EndDate
)
INSERT INTO employee_balance_history
SELECT de.employee_key,
deb.balance_key,
cte.[date],
deb.bank_balance - ISNULL(SUM(feb.bank_delta), 0)
FROM employee de
INNER JOIN employee_balance deb on deb.employee_key = de.employee_key
CROSS JOIN cte_bank_dates cte
LEFT OUTER JOIN employee_balance_trans feb ON feb.balance_key = deb.balance_key
AND feb.employee_key = deb.employee_key
AND feb.bank_date > cte.[date]
GROUP BY de.employee_key, deb.balance_key, cte.[date], deb.bank_balance
OPTION (MAXRECURSION 0)
I use the CTE to get only the dates in the correct range. I need to have each date in the range, so that I know which future dates to exclude from the aggregate option. The resulting query to get bank balances as of a given date is blazing fast.
Today, I had my hands slapped and was told that the CROSS JOIN to the CTE was not needed and to optimize the query because it was slowing everything else down when it runs.
Leaving aside the fact that it will run overnight once in production, I'm left to wonder if there's a better way to populate this table for every employee, every bank and every date. The number of rows is unavoidable, as is the calculation to strip out future dated transactions from the employee bank balance.
Does anyone have any idea how I might make this faster, and less resource intensive on the server?

Related

Populating a list of dates without a defined end date - SQL server

I have a list of accounts and their cost which changes every few days.
In this list I only have the start date every time the cost updates to a new one, but no column for the end date.
Meaning, I need to populate a list of dates when the end date for a specific account and cost, should be deduced as the start date of the same account with a new cost.
More or less like that:
Account start date cost
one 1/1/2016 100$
two 1/1/2016 150$
one 4/1/2016 200$
two 3/1/2016 200$
And the result I need would be:
Account date cost
one 1/1/2016 100$
one 2/1/2016 100$
one 3/1/2016 100$
one 4/1/2016 200$
two 1/1/2016 150$
two 2/1/2016 150$
two 3/1/2016 200$
For example, if the cost changed in the middle of the month, than the sample data will only hold two records (one per each unique combination of account-start date-cost), while the results will hold 30 records with the cost for each and every day of the month (15 for the first cost and 15 for the second one). The costs are a given, and no need to calculate them (inserted manually).
Note the result contains more records because the sample data shows only a start date and an updated cost for that account, as of that date. While the results show the cost for every day of the month.
Any ideas?
Solution is a bit long.
I added an extra date for test purposes:
DECLARE #t table(account varchar(10), startdate date, cost int)
INSERT #t
values
('one','1/1/2016',100),('two','1/1/2016',150),
('one','1/4/2016',200),('two','1/3/2016',200),
('two','1/6/2016',500) -- extra row
;WITH CTE as
( SELECT
row_number() over (partition by account order by startdate) rn,
*
FROM #t
),N(N)AS
(
SELECT 1 FROM(VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1))M(N)
),
tally(N) AS -- tally is limited to 1000 days
(
SELECT ROW_NUMBER()OVER(ORDER BY N.N) - 1 FROM N,N a,N b
),GROUPED as
(
SELECT
cte.account, cte.startdate, cte.cost, cte2.cost cost2, cte2.startdate enddate
FROM CTE
JOIN CTE CTE2
ON CTE.account = CTE2.account
and CTE.rn = CTE2.rn - 1
)
-- used DISTINCT to avoid overlapping dates
SELECT DISTINCT
CASE WHEN datediff(d, startdate,enddate) = N THEN cost2 ELSE cost END cost,
dateadd(d, N, startdate) startdate,
account
FROM grouped
JOIN tally
ON datediff(d, startdate,enddate) >= N
Result:
cost startdate account
100 2016-01-01 one
100 2016-01-02 one
100 2016-01-03 one
150 2016-01-01 two
150 2016-01-02 two
200 2016-01-03 two
200 2016-01-04 one
200 2016-01-04 two
200 2016-01-05 two
500 2016-01-06 two
Thank you #t-clausen.dk!
It didn't solve the problem completely, but did direct me in the correct way.
Eventually I used the LEAD function to generate an end date for every cost per account, and then I was able to populate a list of dates based on that idea.
Here's how I generate the end dates:
DECLARE #t table(account varchar(10), startdate date, cost int)
INSERT #t
values
('one','1/1/2016',100),('two','1/1/2016',150),
('one','1/4/2016',200),('two','1/3/2016',200),
('two','1/6/2016',500)
select account
,[startdate]
,DATEADD(DAY, -1, LEAD([Startdate], 1,'2100-01-01') OVER (PARTITION BY account ORDER BY [Startdate] ASC)) AS enddate
,cost
from #t
It returned the expected result:
account startdate enddate cost
one 2016-01-01 2016-01-03 100
one 2016-01-04 2099-12-31 200
two 2016-01-01 2016-01-02 150
two 2016-01-03 2016-01-05 200
two 2016-01-06 2099-12-31 500
Please note that I set the end date of current costs to be some date in the far future which means (for me) that they are currently active.

Grouping Multiple Levels in a Query

So this is what I have.
Have a function that returns loan exceptions. So if someone has a missing document, that's an exception, or a signature required, that's an exception etc.
The problem is that ALL of the information being returned by this function contains all of the information for the loan. Including the amount.
So if there are 6 exceptions on a single loan, and the loan is 1000, then totaling the amount by exceptions gives you 6000, because 1000 is stored in every record detail.
So here is a similar set of records that I am returning.
poolDesc| loanNumber| Exception | Amount
Consumer| 123 | Missing Sig| 100
Consumer| 123 | Missing Doc| 100
Consumer| 123 | Late Pymt | 100
Estate | 456 | Address Ent| 2000
Estate | 456 | Missing Doc| 2000
Estate | 789 | Missing Sig| 1000
Consumer| 345 | Missing Sig| 500
What I am looking for out of that selection is:
POOL CountExceptions LoanAmount
Consumer 4 600
Estate 3 3000
There has to be a way to do this, and its going to an SSRS report if that helps.
Thanks
SELECT poolDesc Pool,
SUM(CountExceptions) CountExceptions,
SUM(LoanAmount) LoanAmount
FROM (
SELECT
poolDesc ,
COUNT(*) CountExceptions,
SUM(Amount) OVER (PARTITION BY poolDesc, loanNumber ) LoanAmount
FROM
loanExceptions
GROUP BY poolDesc, loanNumber, Amount
) a
GROUP BY poolDesc
full test script
create table loanExceptions (
poolDesc varchar(50),
loanNumber int,
Exception varchar(50),
Amount float)
insert into loanExceptions values
('Consumer',123,'Missing Sig',100),
('Consumer',123,'Missing Doc',100),
('Consumer',123,'Late Pymt',100),
('Estate',456,'Address Ent',2000),
('Estate',456,'Missing Doc',2000),
('Estate',789,'Missing Sig',1000),
('Consumer',345,'Missing Sig',500)
SELECT poolDesc Pool,
SUM(CountExceptions) CountExceptions,
SUM(LoanAmount) LoanAmount
FROM (
SELECT
poolDesc ,
COUNT(*) CountExceptions,
SUM(Amount) OVER (PARTITION BY poolDesc, loanNumber ) LoanAmount
FROM
loanExceptions
GROUP BY poolDesc, loanNumber, Amount
) a
GROUP BY poolDesc
DROP TABLE loanExceptions

Best method to merge two sql tables toegether

So I have two tables. One tracking a a persons location, and one that has the shifts of staff members.
Staff members have a staffId, location, start and end times, and cost of that shift.
People have an eventId, stayId, personId, location, start and end time. A person will have an event with multiple stays.
What I am attempting to do is mesh these two tables together, so that I can accurately report the cost of each location stay, based on the duration of that stay multiplied by the associated cost of staff covering that location at that time.
The issues I have are:
Location stays do not align with staff shifts. i.e. a person might be in location a between 1pm and 2pm, and four staff might be on shifts from 12:30 to 1:30, and two on from 1:30 till 5.
There are a lot of records.
Not all staff are paid the same
My current method is to expand both tables to have a record for every single minute. So a stay that is between 1pm and 2pm will have 60 records, and a staff shift that goes for 5 hours will have 300 records. I can then take all staff that are working on that location at that minute to get a minute value based on the cost of each staff member divided by the duration of their shift, and apply that value to the corresponding record in the other table.
Techniques used:
I create a table with 50,000 numbers, since some stays can be quite
long.
I take the staff table and join onto the numbers table to split each
shift. Then group it together based on location and minute, with a
staff count and minute cost.
The final step, and the one causing issues, is where I take the
location table, join onto numbers, and also onto the modified staff
table to produce a cost for that minute. I also count the number of
people in that location to account for staff covering multiple
people.
I'm finding this process extremely slow as you can imagine, since my person table has about 500 million records when expanded to the minute level, and the staff table has about 35 million when the same thing is done.
Can people suggest a better method for me to use?
Sample data:
Locations
| EventId | ID | Person | Loc | Start | End
| 1 | 987 | 123 | 1 | May, 20 2015 07:00:00 | May, 20 2015 08:00:00
| 1 | 374 | 123 | 4 | May, 20 2015 08:00:00 | May, 20 2015 10:00:00
| 1 | 184 | 123 | 3 | May, 20 2015 10:00:00 | May, 20 2015 11:00:00
| 1 | 798 | 123 | 8 | May, 20 2015 11:00:00 | May, 20 2015 12:00:00
Staff
| Loc | StaffID | Cost | Start | End
| 1 | 99 | 40 | May, 20 2015 04:00:00 | May, 20 2015 12:00:00
| 1 | 15 | 85 | May, 20 2015 03:00:00 | May, 20 2015 5:00:00
| 3 | 85 | 74 | May, 20 2015 18:00:00 | May, 20 2015 20:00:00
| 4 | 10 | 36 | May, 20 2015 06:00:00 | May, 20 2015 14:00:00
Result
| EventId | ID | Person | Loc | Start | End | Cost
| 1 | 987 | 123 | 1 | May, 20 2015 07:00:00 | May, 20 2015 08:00:00 | 45.50
| 1 | 374 | 123 | 4 | May, 20 2015 08:00:00 | May, 20 2015 10:00:00 | 81.20
| 1 | 184 | 123 | 3 | May, 20 2015 10:00:00 | May, 20 2015 11:00:00 | 95.00
| 1 | 798 | 123 | 8 | May, 20 2015 11:00:00 | May, 20 2015 12:00:00 | 14.75
SQL:
Numbers table
;WITH x AS
(
SELECT TOP (224) object_id FROM sys.all_objects
)
SELECT TOP (50000) n = ROW_NUMBER() OVER (ORDER BY x.object_id)
INTO #numbers
FROM x CROSS JOIN x AS y
ORDER BY n
Staff Table
SELECT
Location,
ISNULL(SUM(ROUND(Cost/ CASE WHEN (DateDiff(MINUTE, StartDateTime, EndDateTime)) = 0 THEN 1 ELSE (DateDiff(MINUTE, StartDateTime, EndDateTime)) END, 5)),0) AS MinuteCost,
Count(Name) AS StaffCount,
RosterMinute = DATEADD(MI, DATEDIFF(MI, 0, StartDateTime) + n.n -1, 0)
INTO #temp_StaffRoster
FROM dbo.StaffRoster
Grouping together, and where help is needed I think
INSERT INTO dbo.FinalTable
SELECT [EventId]
,[Id]
,[Start]
,[End]
,event.[Location]
,SUM(ISNULL(MinuteCost,1)/ISNULL(PeopleCount, 1)) AS Cost
,AVG(ISNULL(StaffCount,1)) AS AvgStaff
FROM dbo.Events event WITH (NOLOCK)
INNER JOIN #numbers n ON n.n BETWEEN 0 AND DATEDIFF(MINUTE, Start, End)
LEFT OUTER JOIN #temp_StaffRoster staff WITH (NOLOCK) ON staff.Location= event.Location AND staff.RosterMinute = DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 , 0)
LEFT OUTER JOIN (SELECT [Location], DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 , 0) AS Mins, COUNT(Id) as PeopleCount
FROM dbo.Events WITH (NOLOCK)
INNER JOIN #numbers n ON n.n BETWEEN 0 AND DATEDIFF(MINUTE, Start, End)
GROUP BY [Location], DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 , 0)
) cap ON cap.Location= event.LocationAND cap.Mins = DATEADD(MI, DATEDIFF(MI, 0, Start) + n.n -1 , 0)
GROUP BY [EventId]
,[Id]
,[Start]
,[End]
,event.[Location]
UPDATE
So I have two tables. One tracking a a persons location, and one that has the shifts of staff members with their cost. I am attempting to consolidate the two tables to calculate the cost of each location stay.
Here is my method:
;;WITH stay AS
(
SELECT TOP 650000
StayId,
Location,
Start,
End
FROM stg_Stay
WHERE Loction IS NOT NULL -- Some locations don't currently have a matching shift location
ORDER BY Location, ADTM
),
shift AS
(
SELECT TOP 36000000
Location,
ShiftMinute,
MinuteCost,
StaffCount
FROM stg_Shifts
ORDER BY Location, ShiftMinute
)
SELECT
[StayId],
SUM(MinuteCost) AS Cost,
AVG(StaffCount) AS StaffCount
INTO newTable
FROM stay S
CROSS APPLY (SELECT MinuteCost, StaffCount
FROM shift R
WHERE R.Location = S.Location
AND R.ShiftMinute BETWEEN S.Start AND S.End
) AS Shifts
GROUP BY [StayId]
This is where I'm at.
I've split the Shifts table into a minute by minute level since there is no clear alignment of shifts to stays.
stg_Stay contains more columns than needed for this operation. stg_Shift is as shown.
Indexes used on stg_Shifts:
CREATE NONCLUSTERED INDEX IX_Shifts_Loc_Min
ON dbo.stg_Shifts (Location, ShiftMinute)
INCLUDE (MinuteCost, StaffCount);
on stg_Stay
CREATE INDEX IX_Stay_StayId ON dbo.stg_Stay (StayId);
CREATE CLUSTERED INDEX IX_Stay_Start_End_Loc ON dbo.stg_Stay (Location,Start,End);
Due to the fact that Shifts has ~36 million records and Stays has ~650k, what can I do to make this perform better?
Don't break down the rows by minutes.
Staging table may help if you can create fast relationship between them. i.e. the overlapped interval
SELECT *
FROM Locations l
OUTER APPLY -- Assume a staff won't appear in different location in the same period of time, of course.
(
SELECT
CONVERT(decimal(14,2), SUM(CostPerMinute * OverlappedMinutes)) AS ActualCost,
COUNT(DISTINCT StaffId) AS StaffCount,
SUM(OverlappedMinutes) AS StaffMinutes
FROM
(
SELECT
*,
-- Calculate overlapped time in minutes
DATEDIFF(MINUTE,
CASE WHEN StartTime > l.StartTime THEN StartTime ELSE l.StartTime END, -- Get greatest start time
CASE WHEN EndTime > l.EndTime THEN l.EndTime ELSE EndTime END -- Get least end time
) AS OverlappedMinutes,
Cost / DATEDIFF(MINUTE, StartTime, EndTime) AS CostPerMinute
FROM Staff
WHERE LocationId = l.LocationId
AND StartTime <= l.EndTime AND l.StartTime <= EndTime -- Match with overlapped time
) data
) StaffInLoc
SQL Fiddle
Take below with a grain of salt since your naming is horrible.
Location should really be a Stay as i guess location is another table defining an single physical location.
Your Staff table is also badly named. Why not name it Shift. I would expect a staff table to contain stuff like Name, Phone etc. Where a Shift table can contain multiple shifts for the same Staff etc.
Second i think your missing a relation between the two tables.
If you join Location and Staff only on Location and overlapping date times i don't think it would make a whole lot of sense for what your trying to do. How do you know which staff is at any location for a given time? Onlything you can do with location and overlapping dates is assume a entry is in the location table relates to every staff who have a shift at that location within the timeframe. So look at the below more as an inspiration to solving your problems and how to find overlapping datetime intervals and less like an actual solution to your problem since i think your data and model is in a bad shape.
If i got it all wrong please provide Primary Keys and Foreign Keys on your tables and a better explanation.
Some dummy data
DROP TABLE dbo.Location
CREATE TABLE dbo.Location
(
StayId INT,
EventId INT,
PersonId INT,
LocationId INT,
StartTime DATETIME2(0),
EndTime DATETIME2(0)
)
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES ( 987 ,1 ,123 ,1 ,'2015-05-20T07:00:00','2015-05-20T08:00:00')
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES ( 374 ,1 ,123 ,4 ,'2015-05-20T08:00:00','2015-05-20T10:00:00')
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES ( 184 ,1 ,123 ,3 ,'2015-05-20T10:00:00','2015-05-20T11:00:00')
INSERT INTO dbo.Location ( StayId ,EventId ,PersonId ,LocationId ,StartTime ,EndTime)
VALUES ( 798 ,1 ,123 ,8 ,'2015-05-20T11:00:00','2015-05-20T12:00:00')
DROP TABLE dbo.Staff
CREATE TABLE Staff
(
StaffId INT,
Cost INT,
LocationId INT,
StartTime DATETIME2(0),
EndTime DATETIME2(0)
)
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES ( 99 ,40 ,1 ,'2015-05-20T04:00:00','2015-05-20T12:00:00')
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES ( 15 ,85 ,1 ,'2015-05-20T03:00:00','2015-05-20T05:00:00')
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES ( 85 ,74 ,3 ,'2015-05-20T18:00:00','2015-05-20T20:00:00')
INSERT INTO dbo.Staff ( StaffId ,Cost ,LocationId,StartTime ,EndTime)
VALUES ( 10 ,36 ,4 ,'2015-05-20T06:00:00','2015-05-20T14:00:00')
Actual query
WITH OnLocation AS
(
SELECT
L.StayId, L.EventId, L.LocationId, L.PersonId, S.Cost
, IIF(L.StartTime > S.StartTime, L.StartTime, S.StartTime) AS OnLocationStartTime
, IIF(L.EndTime < S.EndTime, L.EndTime, S.EndTime) AS OnLocationEndTime
FROM dbo.Location L
LEFT JOIN dbo.Staff S
ON S.LocationId = L.LocationId -- TODO are you not missing a join condition on staffid
-- Detects any overlaps between stays and shifts
AND L.StartTime <= S.EndTime AND L.EndTime >= S.StartTime
)
SELECT
*
, DATEDIFF(MINUTE, D.OnLocationStartTime, D.OnLocationEndTime) AS DurationMinutes
, DATEDIFF(MINUTE, D.OnLocationStartTime, D.OnLocationEndTime) / 60.0 * Cost AS DurationCost
FROM OnLocation D
To get a summary you can take the query and add a GROUP BY for whatever your wan't to summarize.

SQL Server - cumulative sum on overlapping data - getting date that sum reaches a given value

In our company, our clients perform various activities that we log in different tables - Interview attendance, Course Attendance, and other general activities.
I have a database view that unions data from all of these tables giving us the ActivityView that looks like this.
As you can see some activities overlap - for example while attending an interview, a client may have been performing a CV update activity.
+----------------------+---------------+---------------------+-------------------+
| activity_client_id | activity_type | activity_start_date | activity_end_date |
+----------------------+---------------+---------------------+-------------------+
| 112 | Interview | 2015-06-01 09:00 | 2015-06-01 11:00 |
| 112 | CV updating | 2015-06-01 09:30 | 2015-06-01 11:30 |
| 112 | Course | 2015-06-02 09:00 | 2015-06-02 16:00 |
| 112 | Interview | 2015-06-03 09:00 | 2015-06-03 10:00 |
+----------------------+---------------+---------------------+-------------------+
Each client has a "Sign Up Date", recorded on the client table, which is when they joined our programme. Here it is for our sample client:
+-----------+---------------------+
| client_id | client_sign_up_date |
+-----------+---------------------+
| 112 | 2015-05-20 |
+-----------+---------------------+
I need to create a report that will show the following columns:
+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+
We need this report in order to see how effective our programme is. An important aim of the programme is that we get every client to complete at least 5 hours of activity as quickly as possible.
So this report will tell us how long from sign up does it take each client to achieve this figure.
What makes this even trickier is that when we calculate 5 hours of total activity, we must discount overlapping activities:
In the sample data above the client attended an interview between 09:00 and 11:00.
On the same day they also performed CV updating activity from 09:30 to 11:30.
For our calculation, this would give them total activity for the day of 2.5 hours (150 minutes) - we would only count 30 minutes of the CV updating as the Interview overlaps it up to 11:00.
So the report for our sample client would give the following result:
+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+
| 112 | 2015-05-20 | 2015-06-02 |
+-----------+---------------------+--------------------------------------------+
So my question is how can I create the report using a select statement ?
I can work out how to do this by writing a stored procedure that will loop through the view and write the result to a report table.
But I would much prefer to avoid a stored procedure and have a select statement that will give me the report on the fly.
I am using SQL Server 2005.
See SQL Fiddle here.
with tbl as (
-- this will generate daily merged ovelaping time
select distinct
a.id
,(
select min(x.starttime)
from act x
where x.id=a.id and ( x.starttime between a.starttime and a.endtime
or a.starttime between x.starttime and x.endtime )
) start1
,(
select max(x.endtime)
from act x
where x.id=a.id and ( x.endtime between a.starttime and a.endtime
or a.endtime between x.starttime and x.endtime )
) end1
from act a
), tbl2 as
(
-- this will add minute and total minute column
select
*
,datediff(mi,t.start1,t.end1) mi
,(select sum(datediff(mi,x.start1,x.end1)) from tbl x where x.id=t.id and x.end1<=t.end1) totalmi
from tbl t
), tbl3 as
(
-- now final query showing starttime and endtime for 5 hours other wise null in case not completed 5(300 minutes) hours
select
t.id
,min(t.start1) starttime
,min(case when t.totalmi>300 then t.end1 else null end) endtime
from tbl2 t
group by t.id
)
-- final result
select *
from tbl3
where endtime is not null
This is one way to do it:
;WITH CTErn AS (
SELECT activity_client_id, activity_type,
activity_start_date, activity_end_date,
ROW_NUMBER() OVER (PARTITION BY activity_client_id
ORDER BY activity_start_date) AS rn
FROM activities
),
CTEdiff AS (
SELECT c1.activity_client_id, c1.activity_type,
x.activity_start_date, c1.activity_end_date,
DATEDIFF(mi, x.activity_start_date, c1.activity_end_date) AS diff,
ROW_NUMBER() OVER (PARTITION BY c1.activity_client_id
ORDER BY x.activity_start_date) AS seq
FROM CTErn AS c1
LEFT JOIN CTErn AS c2 ON c1.rn = c2.rn + 1
CROSS APPLY (SELECT CASE
WHEN c1.activity_start_date < c2.activity_end_date
THEN c2.activity_end_date
ELSE c1.activity_start_date
END) x(activity_start_date)
)
SELECT TOP 1 client_id, client_sign_up_date, activity_start_date,
hoursOfActivicty
FROM CTEdiff AS c1
INNER JOIN clients AS c2 ON c1.activity_client_id = c2.client_id
CROSS APPLY (SELECT SUM(diff) / 60.0
FROM CTEdiff AS c3
WHERE c3.seq <= c1.seq) x(hoursOfActivicty)
WHERE hoursOfActivicty >= 5
ORDER BY seq
Common Table Expressions and ROW_NUMBER() were introduced with SQL Server 2005, so the above query should work for that version.
Demo here
The first CTE, i.e. CTErn, produces the following output:
client_id activity_type start_date end_date rn
112 Interview 2015-06-01 09:00 2015-06-01 11:00 1
112 CV updating 2015-06-01 09:30 2015-06-01 11:30 2
112 Course 2015-06-02 09:00 2015-06-02 16:00 3
112 Interview 2015-06-03 09:00 2015-06-03 10:00 4
The second CTE, i.e. CTEdiff, uses the above table expression in order to calculate time difference for each record, taking into consideration any overlapps with the previous record:
client_id activity_type start_date end_date diff seq
112 Interview 2015-06-01 09:00 2015-06-01 11:00 120 1
112 CV updating 2015-06-01 11:00 2015-06-01 11:30 30 2
112 Course 2015-06-02 09:00 2015-06-02 16:00 420 3
112 Interview 2015-06-03 09:00 2015-06-03 10:00 60 4
The final query calculates the cumulative sum of time difference and selects the first record that exceeds 5 hours of activity.
The above query will work for simple interval overlaps, i.e. when just the end date of an activity overlaps the start date of the next activity.
A Geometric Approach
For another issue, I've taken a geometric approach to date
packing. Namely, I convert dates and times to a sql geometry
type and utilize geometry::UnionAggregate to merge the ranges.
I don't believe this will work in sql-server 2005. But your
problem was such an interesting puzzle that I wanted to see
whether the geometrical approach would work. So any future
users running into this problem that have access to a later
version can consider it.
Code Description
In 'numbers':
I build a table representing a sequence
Swap it out with your favorite way to make a numbers table.
For a union operation, you won't ever need more rows than in
your original table, so I just use it as the base to build it.
In 'mergeLines':
I convert the dates to floats and use those floats
to create geometrical points.
I then connect these points via STUnion and STEnvelope.
Finally, I merge all these lines via UnionAggregate. The resulting
'lines' geometry object might contain multiple lines, but if they
overlap, they turn into one line.
In 'redate':
I use the numbers CTE to extract the individual lines inside 'lines'.
I envelope the lines which here ensures that the lines are stored
only as its two endpoints.
I read the endpoint x values and convert them back to their time
representations (This is usually the end goal, but you need more).
I calculate the difference in minutes between activity start and
end dates (I do this first in seconds then divide by 60 for the
sake of a precision issue).
I calculate the cumulative sume of these minutes for each row.
In the outer query:
I align the previous cumulative minutes sum with each current row
I filter for the row where the 5hr goal was met but where the
previous minutes shows that the 5hr goal for the previous row
was not met.
I then calculate where in the current row's range the user has
met the 5 hours, to not only arrive at the date the five hour
goal was met, but the exact time.
The Code
with
numbers as (
select row_number() over (order by (select null)) i
from #activities -- where I put your data
),
mergeLines as (
select activity_client_id,
lines = geometry::UnionAggregate(line)
from #activities
cross apply (select
startP = geometry::Point(convert(float,activity_start_date), 0, 0),
stopP = geometry::Point(convert(float,activity_end_date), 0, 0)
) pointify
cross apply (select line = startP.STUnion(stopP).STEnvelope()) lineify
group by activity_client_id
),
redate as (
select client_id = activity_client_id,
activities_start_date,
activities_end_date,
minutes,
rollingMinutes = sum(minutes) over(
partition by activity_client_id
order by activities_start_date
rows between unbounded preceding and current row
)
from mergeLines ml
join numbers n on n.i between 1 and ml.lines.STNumGeometries()
cross apply (select line = ml.lines.STGeometryN(i).STEnvelope()) l
cross apply (select
activities_start_date = convert(datetime, l.line.STPointN(1).STX),
activities_end_date = convert(datetime, l.line.STPointN(3).STX)
) unprepare
cross apply (select minutes =
round(datediff(s, activities_start_date, activities_end_date) / 60.0,0)
) duration
)
select client_id,
activities_start_date,
activities_end_date,
met_5hr_goal = dateadd(minute, (60 * 5) - prevRoll, activities_start_date)
from (
select *,
prevRoll = lag(rollingMinutes) over (
partition by client_id
order by rollingMinutes
)
from redate
) ranker
where rollingMinutes >= 60 * 5
and prevRoll < 60 * 5;

Netezza: Show dates even if 0 data for that day

I have this query through an odbc connection in excel for a refreshable report with data for every 4 weeks. I need to show the dates in each of the 4 weeks even if there is no data for that day because this data is then linked to a Graph. Is there a way to do this?
thanks.
Select b.INV_DT, sum( a.ORD_QTY) as Ordered, sum( a.SHIPPED_QTY) as Shipped
from fct_dly_invoice_detail a, fct_dly_invoice_header b, dim_invoice_customer c
where a.INV_HDR_SK = b.INV_HDR_SK
and b.DIM_INV_CUST_SK = c.DIM_INV_CUST_SK
and a.SRC_SYS_CD = 'ABC'
and a.NDC_NBR is not null
**and b.inv_dt between CURRENT_DATE - 16 and CURRENT_DATE**
and b.store_nbr in (2851, 2963, 3249, 3385, 3447, 3591, 3727, 4065, 4102, 4289, 4376, 4793, 5209, 5266, 5312, 5453, 5569, 5575, 5892, 6534, 6571, 7110, 9057, 9262, 9652, 9742, 10373, 12392, 12739, 13870
)
group by 1
The general purpose solution to this is to create a date dimension table, and then perform an outer join to that date dimension table on the INV_DT column.
There are tons of good resources you can search for on creating a good date dimension table, so I'll just create a quick and dirty (and trivial) example here. I highly recommend some research in that area if you'll be doing a lot of BI/reporting.
If our table we want to report from looks like this:
Table "TABLEZ"
Attribute | Type | Modifier | Default Value
-----------+--------+----------+---------------
AMOUNT | BIGINT | |
INV_DT | DATE | |
Distributed on random: (round-robin)
select * from tablez order by inv_dt
AMOUNT | INV_DT
--------+------------
1 | 2015-04-04
1 | 2015-04-04
1 | 2015-04-06
1 | 2015-04-06
(4 rows)
and our report looks like this:
SELECT inv_dt,
SUM(amount)
FROM tablez
WHERE inv_dt BETWEEN CURRENT_DATE - 5 AND CURRENT_DATE
GROUP BY inv_dt;
INV_DT | SUM
------------+-----
2015-04-04 | 2
2015-04-06 | 2
(2 rows)
We can create a date dimension table that contains a row for every date (or ate last 1024 days in the past and 1024 days in the future using the _v_vector_idx view in this example).
create table date_dim (date_dt date);
insert into date_dim select current_date - idx from _v_vector_idx;
insert into date_dim select current_date + idx +1 from _v_vector_idx;
Then our query would look like this:
SELECT d.date_dt,
SUM(amount)
FROM tablez a
RIGHT OUTER JOIN date_dim d
ON a.inv_dt = d.date_dt
WHERE d.date_dt BETWEEN CURRENT_DATE -5 AND CURRENT_DATE
GROUP BY d.date_dt;
DATE_DT | SUM
------------+-----
2015-04-01 |
2015-04-02 |
2015-04-03 |
2015-04-04 | 2
2015-04-05 |
2015-04-06 | 2
(6 rows)
If you actually needed a zero value instead of a NULL for the days where you had no data, you could use a COALESCE or NVL like this:
SELECT d.date_dt,
COALESCE(SUM(amount),0)
FROM tablez a
RIGHT OUTER JOIN date_dim d
ON a.inv_dt = d.date_dt
WHERE d.date_dt BETWEEN CURRENT_DATE -5 AND CURRENT_DATE
GROUP BY d.date_dt;
DATE_DT | COALESCE
------------+----------
2015-04-01 | 0
2015-04-02 | 0
2015-04-03 | 0
2015-04-04 | 2
2015-04-05 | 0
2015-04-06 | 2
(6 rows)
I agree with #ScottMcG that you need to get the list of dates. However if you are in a situation where you aren't allowed to create a table. You can simplify things. All you need is a table that has at least 28 rows. Using your example, this should work.
select date_list.dt_nm, nvl(results.Ordered,0) as Ordered, nvl(results.Shipped,0) as Shipped
from
(select row_number() over(order by sub.arb_nbr)+ (current_date -28) as dt_nm
from (select rowid as arb_nbr
from fct_dly_invoice_detail b
limit 28) sub ) date_list left outer join
( Select b.INV_DT, sum( a.ORD_QTY) as Ordered, sum( a.SHIPPED_QTY) as Shipped
from fct_dly_invoice_detail a inner join
fct_dly_invoice_header b
on a.INV_HDR_SK = b.INV_HDR_SK
and a.SRC_SYS_CD = 'ABC'
and a.NDC_NBR is not null
**and b.inv_dt between CURRENT_DATE - 16 and CURRENT_DATE**
and b.store_nbr in (2851, 2963, 3249, 3385, 3447, 3591, 3727, 4065, 4102, 4289, 4376, 4793, 5209, 5266, 5312, 5453, 5569, 5575, 5892, 6534, 6571, 7110, 9057, 9262, 9652, 9742, 10373, 12392, 12739, 13870)
inner join
dim_invoice_customer c
on b.DIM_INV_CUST_SK = c.DIM_INV_CUST_SK
group by 1 ) results
on date_list.dt_nm = results.inv_dt

Resources