Need to generate rows with missing data in a large dataset - SQL - sql-server

We are comparing values between months over multiple years. As time moves on the number of years and months in the dataset increases. We are only interested in months where there were values for every year, i.e. a full set.
Consider the following example for 1 month (1) over 3 years (1,2,3) and two activities (101, 102)
Dataset:
Activity Month year Count
------- ---- ------ ------
101 1 1 2
101 1 2 3
101 1 3 1
102 1 1 1
102 1 2 1
In the example above only activity 101 will come into consideration as it satisfies the condition that there must be a count for the activity for month 1 IN year 1, 2 and 3.
Activity 102 doesn't qualify for further analysis as it has no record for year 3.
I would like to generate a record with which I can then evaluate this. The record will effectively generate the new record with the missing row (in this case 102, 1, 3 , 0) to complete the dataset
Activity Month year Count
------- ---- ------ ------
102 1 3 0
We find the problem difficult as the data keeps in growing, the number of activities keep expanding and it is a combination of activity, year and month that need to be evaluated.
An elegant solution will be appreciated.

As I mention in my comment, presumably you have both an Activity table and some kind of Calendar table with details of your activities and the years in your system. As such you can therefore do a CROSS JOIN between these 2 objects and then LEFT JOIN to your table to get the data set you want:
--Create sample objects/data
CREATE TABLE dbo.Activity (Activity int); --Obviously your table has more columns
INSERT INTO dbo.Activity (Activity)
VALUES (101),(102);
GO
CREATE TABLE dbo.Calendar (Year int,
Month int);--Likely your table has more columns
INSERT INTO dbo.Calendar (Year, Month)
VALUES(1,1),
(2,1),
(3,1);
GO
CREATE TABLE dbo.YourTable (Activity int,
Year int,
Month int,
[Count] int);
INSERT INTO dbo.YourTable (Activity,Month, Year, [Count])
VALUES(101,1,1,2),
(101,1,2,3),
(101,1,3,1),
(102,1,1,1),
(102,1,2,1);
GO
--Solution
SELECT A.Activity,
C.Month,
C.Year,
ISNULL(YT.[Count],0) AS [Count]
FROM dbo.Activity A
CROSS JOIN dbo.Calendar C
LEFT JOIN dbo.YourTable YT ON A.Activity = YT.Activity
AND C.[Year] = YT.[Year]
AND C.[Month] = YT.[Month]
WHERE C.Month = 1; --not sure if this is needed
If you don't have an Activity and Calendar table (I suggest, however, you should), then you can use subqueries with a DISTINCT, but note this will be far from performant with large data sets:
SELECT A.Activity,
C.Month,
C.Year,
ISNULL(YT.[Count],0) AS [Count]
FROM (SELECT DISTINCT Activity FROM dbo.YourTable) A
CROSS JOIN (SELECT DISTINCT Year, Month FROM dbo.YourTable) C
LEFT JOIN dbo.YourTable YT ON A.Activity = YT.Activity
AND C.[Year] = YT.[Year]
AND C.[Month] = YT.[Month]
WHERE C.Month = 1; --not sure if this is needed

Related

Create a select statement that returns a record for each day after a given created date

I have a Dimension table containing machines.
Each machine has a date created value.
I would like to have a Select statement that generates for each day after a certain start date the available number of machines. A machine is available after the date created on wards
As I have read only access to the database I am not able to create a physical calendar table
I hope somebody can help me solving my issue
I assume this is what you want. Based on this sample table:
USE tempdb;
GO
CREATE TABLE dbo.Machines
(
MachineID int,
CreatedDate date
);
INSERT dbo.Machines VALUES(1,'20200104'),(2,'20200202'),(3,'20200214');
Then say you wanted the number of active machines starting on January 1st:
DECLARE #StartDate date = '20200101';
;WITH x AS
(
SELECT n = 0 UNION ALL SELECT n + 1 FROM x
WHERE n < DATEDIFF(DAY, #StartDate, GETDATE())
),
days(d) AS
(
SELECT DATEADD(DAY, x.n, #StartDate) FROM x
)
SELECT days.d, MachineCount = COUNT(m.MachineID)
FROM days
LEFT OUTER JOIN dbo.Machines AS m
ON days.d >= m.CreatedDate
GROUP BY days.d
ORDER BY days.d
OPTION (MAXRECURSION 0);
Results:
d MachineCount
---------- ------------
2020-01-01 0
2020-01-02 0
2020-01-03 0
2020-01-04 1
2020-01-05 1
...
2020-01-31 1
2020-02-01 1
2020-02-02 2
2020-02-03 2
...
2020-02-12 2
2020-02-13 2
2020-02-14 3
2020-02-15 3
Clean up:
DROP TABLE dbo.Machines;
(Yes, some people hiss at recursive CTEs. You can replace it with any number of set generation techniques, some I talk about here, here, and here.)

Count Distinct Persons Per Year, but Only Once

How would I select distinct persons per year, but only count each person once.
An example of my data is:
ID Date
1 20NOV2018
2 06JUN2017
2 29JUL2011
3 05MAY2014
4 04APR2002
4 25APR2009
I want my output to look like:
2002 1
2009 0
2011 1
2014 1
2017 0
2018 1
Use sub-selects to left join the distinct years with the first year an id occurs and count the ids from that.
data have;
input
id date date9.; format date date9.; datalines;
1 20NOV2018
2 06JUN2017
2 29JUL2011
3 05MAY2014
4 04APR2002
4 25APR2009
run;
proc sql;
create table want as
select each.year, count(first.id) as appeared_count
from
( select distinct year(date) as year
from have
) as each
left join
( select id, min(year(date)) as year
from have group by id
) as first
on each.year = first.year
group by each.year
order by each.year
;
Hope this Code Works Fine for Your case:
SELECT YEAR(M.DATE) AS DATE,COUNT(S.ID)
FROM #TAB M
LEFT JOIN (SELECT MIN(YEAR(DATE)) AS DATE ,ID
FROM #TAB GROUP BY ID) S ON YEAR(M.DATE)=S.DATE GROUP BY YEAR(M.DATE) ORDER BY YEAR(M.DATE)

Calculating Year to Date Total

I want to generate a Payroll type query whereby the values in Payroll 1 (say for the previous month) should be included in Payroll 2 (for the current month) Year-to-Date Totals.
This can best be explained with an example:
DECLARE #MyTable TABLE(ID INT IDENTITY, PayrollID INT, Description NVARCHAR(MAX), [Current Month] MONEY)
INSERT INTO #MyTable
VALUES (1,'Basic Salary',100),
(1,'Normal Over Time',50),
(1,'Work on Saturday',150),
(1,'Work on Sunday',200),
(2,'Basic Salary',100)
SELECT * ,SUM([Current Month]) OVER (PARTITION BY Description ORDER BY PayrollID) AS [Month to Date]
FROM #MyTable
When I run the above I get
ID EmployeeID PayrollID Description Current Month Month to Date
1 1 1 Basic Salary 100 100
2 1 1 Normal Over Time 50 50
3 1 1 Work on Saturday 150 150
4 1 1 Work on Sunday 200 200
5 1 2 Basic Salary 100 200
The Year-to-Date running totals are per each Description meaning Basic Salary Category has its own running total and so does Saturday and Sunday etc, etc. You will notice that for Basic Salary in Payroll 2 the running Year-to-Date total is 200 (i.e. 100 from Payroll 1 + 100 from Payroll 2)
The challenge I have is that Payroll 1 has data for Basic Salary, Work on Saturday and Work on Sunday whereas Payroll 2 only has Basic Salary as the employee did not work on Saturday nor on Sunday in Payroll 2 (the current month).
However, in the cumulative Year-to-Date column the data from Payroll 1 (previous month) should still be selected and included in the Year-to-Date running Total -
something like this:
ID EmployeeID PayrollID Description Current Month Month to Date
1 1 1 Basic Salary 100 100
2 1 1 Normal Over Time 50 50
3 1 1 Work on Saturday 150 150
4 1 1 Work on Sunday 200 200
5 1 2 Basic Salary 100 200
2 1 1 Normal Over Time NULL 50
3 1 1 Work on Saturday NULL 150
4 1 1 Work on Sunday NULL 200
Although the employee did not work on Saturday nor Sunday in the current month (Payroll 2) the running (Year-to-Date) totals for working on a Saturday should be 150 that he/she worked in the previous month (Payroll 1). The same should apply to working on Sunday where the running total in the current month (Payroll 2) should be the 200 that he/she worked in the previous month (Payroll 1).
How do I do that with a simple Select Statement without writing a complicated Procedure?
EDIT:
I have cleaned up the ode as follows:
DECLARE #MyTable TABLE(ID INT IDENTITY, EmployeeID INT, PayrollID INT, Description NVARCHAR(MAX), [Current Month] MONEY)
INSERT INTO #MyTable
VALUES (1,1,'Basic Salary',100),
(1,1,'Normal Over Time',50),
(1,1,'Work on Saturday',150),
(1,1,'Work on Sunday',200),
(1,2,'Basic Salary',100)
WITH pay_elements AS
(
SELECT Description
FROM #MyTable
GROUP BY Description
)
,pay_slips AS
(
SELECT EmployeeID, PayrollID
FROM #MyTable
GROUP BY EmployeeID, PayrollID
)
,pay_lines AS
(
SELECT
mt.ID
,PS.EmployeeID
,PS.PayrollID
,PE.Description
,ISNULL(mt.[Current Month], 0) AS [Current Month]
FROM
pay_slips AS ps
OUTER APPLY
pay_elements AS pe
LEFT JOIN
#MyTable AS mt
ON (mt.EmployeeID = ps.EmployeeID)
AND (mt.PayrollID = ps.PayrollID)
AND (mt.Description = pe.Description)
)
SELECT * ,SUM([Current Month]) OVER (PARTITION BY EmployeeID, Description ORDER BY PayrollID) AS [Month to Date]
FROM pay_lines
And I get this error:
Msg 319, Level 15, State 1, Line 10
Incorrect syntax near the keyword 'with'. If this statement is a common table expression, an xmlnamespaces clause or a change tracking context clause, the previous statement must be terminated with a semicolon.
Msg 102, Level 15, State 1, Line 17
Incorrect syntax near ','.
Msg 102, Level 15, State 1, Line 23
Incorrect syntax near ','.
You first need to build a "structure" of row headings, and then join that onto the actual data.
So for example:
WITH pay_elements AS
(
SELECT Description
FROM #MyTable
GROUP BY Description
)
,pay_slips AS
(
SELECT EmployeeID, PayrollID
FROM #MyTable
GROUP BY EmployeeID, PayrollID
)
,pay_lines AS
(
SELECT
mt.ID
,pay_slips.EmployeeID
,pay_slips.PayrollID
,pay_elements.Description
,ISNULL(mt.Current_Month, 0) AS Current_Month
FROM
pay_slips AS ps
OUTER APPLY
pay_elements AS pe
LEFT JOIN
#MyTable AS mt
ON (mt.EmployeeID = ps.EmployeeID)
AND (mt.PayrollID = ps.PayrollID)
AND (mt.Description = pe.Description)
)
SELECT * ,SUM([Current Month]) OVER (PARTITION BY EmployeeID, Description ORDER BY PayrollID) AS [Month to Date]
FROM pay_lines
What we're doing here is getting a list of the different kind of pay elements in your table. Then we're getting a list of Employees and Payrolls done to date, and manually forcing every Payroll to include a row in respect of all possible pay elements.
Once that structure is built, we join onto the base table to get the actual values (replacing NULLs with zeros, for those pay elements that weren't originally included in the base table).
Then we simply query this padded-out table in the same way you did originally.
Note, I've written this on the fly and haven't checked this code so please excuse any minor errors.
I am little confused with the column you mentioned Year-to-Date in your description. I assume this might be [Month to Date] column present in your query. Please correct me if I am wrong.
I think what you are trying to achieve is - the descriptions which are not present in payroll ID 2 like Work on Saturday and Work on Sunday should also be selected below the result set.
Problem is:
Summation of NULL value is always NULL so if [Current Month] value is NULL then you can not achieve to display 50,150,200 in the [Month to Date] column
You can have fixed categories against each payroll id:
Normal Over Time
Work on Saturday
Work on Sunday
Basic Salary
Query:
DECLARE #MyTable TABLE(ID INT IDENTITY, PayrollID INT, Description NVARCHAR(MAX), [Current Month] MONEY)
INSERT INTO #MyTable
VALUES (1,'Basic Salary',100),
(1,'Normal Over Time',50),
(1,'Work on Saturday',150),
(1,'Work on Sunday',200),
(2,'Basic Salary',100),
(2,'Normal Over Time',0),
(2,'Work on Saturday',0),
(2,'Work on Sunday',0)
SELECT * ,SUM([Current Month]) OVER (PARTITION BY Description ORDER BY PayrollID) AS [Month to Date]
FROM #MyTable order by ID,PayrollID

SQL- Finding a gap that is x amount of months with the same foreign key

I am editing this to clarify my question.
Let's say I have a table that holds patient information. I need to find new patients for this year, and the date of their prescription first prescription when they were considered new. Anytime there is a six month gap they are considered a new patient.
How do I accomplish this using SQL. I can do this in Java and any other imperative language easily enough, but I am having problems doing this in SQL. I need this script to be run in Crystal by non-SQL users
Table:
Patient ID Prescription Date
-----------------------------------------
1 12/31/16
1 03/13/17
2 10/10/16
2 05/11/17
2 06/11/17
3 01/01/17
3 04/20/17
4 01/31/16
4 01/01/17
4 07/02/17
So Patients 2 and 4 are considered new patients. Patient 4 is considered a new patient twice, so I need dates for each time patient 4 was considered new 1/1/17 and 7/2/17. Patients 1 and 3 are not considered new this year.
So far I have the code below which tells me if they are new this year, but not if they had another six month gap this year.
SELECT DISTINCT
this_year.patient_id
,this_year.date
FROM (SELECT
patient_id
,MIN(prescription_date) as date
FROM table
WHERE prescription_date BETWEEN '2017-01-01 00:00:00.000' AND '2017-
12-31 00:00:00.000'
GROUP BY [patient_id]) AS this_year
LEFT JOIN (SELECT
patient_id
,MAX(prescription_date) as date
FROM table
WHERE prescription_date BETWEEN '2016-01-01 00:00:00.000' AND '2016-
12-31 00:00:00.000'
GROUP BY [patient_id]) AS last_year
WHERE DATEDIFF(month, last_year.date, this_year.date) > 6
OR last_year.date IS NULL
Patient 2 in your example does not meet the criteria you specified ... that being said ...
You can try something like this ... untested but should be similar (assuming you can put this in a stored procedure):
WITH ordered AS
(
SELECT *, ROW_NUMBER() OVER (ORDER BY [Prescription Date]) rn
FROM table1
)
SELECT o1.[PatientID], DATEDIFF(s, o1.[Prescription Date], o2.[Prescription Date]) diff
FROM ordered o1 JOIN ordered o2
ON o1.rn + 1 = o2.rn
WHERE DATEDIFF(m, o1.[Prescription Date], o2.[Prescription Date]) > 6
Replace table1 with the name of your table.
I assume that you mean the patient has not been prescribed in the last 6 months.
SELECT DISTINCT user_id
FROM table_name
WHERE prescribed_date >= DATEADD(month, -6, GETDATE())
This gives you the list of users that have been prescribed in the last 6 months. You want the list of users that are not in this list.
SELECT DISTINCT user_id
FROM table_name
WHERE user_id NOT IN (SELECT DISTINCT user_id
FROM table_name
WHERE prescribed_date >= DATEADD(month, -6, GETDATE()))
You'll need to amend the field and table names.

Using t-sql to select aggregate when date difference is not just equal but small

I have a table where I want to select the maximum of a column but based on when the date difference is equal or small (lets say 3 days). When two subsequent dates are very close, the data are likely spurious and I want to get the highest state when that happens.
My data looks similar to this
DECLARE #TestingResults TABLE (
IDNumber varchar(100),
DateSeen date,
[state] int)
INSERT INTO #TestingResults VALUES
('A','2015-04-21',2),
('A','2015-05-08',2),
('A','2015-07-01',3),
('B','2014-06-18',100), -- this is the one I want
('B','2014-06-19',2),
('B','2014-07-31',2),
('B','2014-08-11',3),
('B','2014-09-24',3),
('B','2014-10-24',3),
('B','2014-11-24',3),
('B','2014-12-15',3),
('B','2015-01-12',3),
('B','2015-01-13',400), -- this is the one I want
('B','2015-04-06',10), -- either will do
('B','2015-04-07',10),
('B','2015-07-06',3), -- either will do
('B','2015-07-07',3),
('B','2015-10-12',3),
('C','2012-02-20',3),
('C','2012-03-12',3),
('C','2012-04-02',3),
('C','2012-11-21',3)
What I really want is something like this where I take the maximum of state when the difference between dates is < 3 (note, some of the data may have the same state even when the differences in date are small ...) :
IDNumber DateSeen state
A 2015-04-21 2
A 2015-05-08 2
A 2015-07-01 3
-- if there are observations < 3 days apart, take MAX
B 2014-06-18 100
B 2014-07-31 2
B 2014-08-11 3
B 2014-09-24 3
B 2014-10-24 3
B 2014-11-24 3
B 2014-12-15 3
-- if there are observations < 3 days apart, take MAX
B 2015-01-13 400
-- if there are observations < 3 days apart, take MAX
B 2015-04-07 10
-- if there are observations < 3 days apart, take MAX
B 2015-07-07 3
B 2015-10-12 3
C 2012-02-20 3
C 2012-03-12 3
C 2012-04-02 3
C 2012-11-21 3
I guess I could create another variable table to hold it and then query it but there are a couple of problems. First as you can see, IDNumber='B' has a couple of triggers in its sequences of dates so I am thinking there should be an 'smarter' way.
Thanks!
After your clarifying comments (thanks for that!), I would do this as follows:
SELECT ISNULL(high.IDNumber, results.IDNumber) AS IDNumber,
ISNULL(high.DateSeen, results.DateSeen) AS DateSeen,
ISNULL(high.[state], results.[state]) AS [state]
FROM #TestingResults results
OUTER APPLY
(
SELECT TOP 1 IDNumber, DateSeen, [state]
FROM #TestingResults highest
WHERE highest.DateSeen < results.DateSeen
AND highest.IDNumber = results.IDNumber
AND DATEDIFF(DAY,highest.DateSeen,results.DateSeen) <=3
ORDER BY [state] DESC, [DateSeen] DESC
) high
WHERE NOT EXISTS
(
SELECT 1
FROM #TestingResults nearFuture
WHERE nearFuture.DateSeen > results.DateSeen
AND nearFuture.IDNumber = results.IDNumber
AND DATEDIFF(DAY,results.DateSeen,nearFuture.DateSeen) <=3
)
This is almost certainly not the most elegant way to achieve this (I suspect this could be done more efficiently with Window Functions or a recursive CTE or similar), I believe it gives you the behaviour and results you desire.
This should do it using a recursive CTE:
WITH TestingResults AS (
SELECT
*
,ROW_NUMBER() OVER(ORDER BY IDNumber, DateSeen) AS RowNum
FROM #TestingResults
), Data AS (
SELECT
tmp1.IDNumber,
tmp1.DateSeen,
tmp1.state,
tmp1.RowNum,
tmp1.RowNum AS GroupID
FROM (
SELECT
*
,ABS(DATEDIFF(DAY, DateSeen, LAG(DateSeen, 1, NULL) OVER(PARTITION BY IDNumber ORDER BY DateSeen))) AS AbsPrev
FROM TestingResults
) AS tmp1
WHERE tmp1.AbsPrev IS NULL OR tmp1.AbsPrev >= 3 --the first date in a sequence
UNION ALL
SELECT
r.IDNumber,
r.DateSeen,
r.state,
r.RowNum,
d.GroupID
FROM Data d
INNER JOIN TestingResults r ON
r.IDNumber = d.IDNumber
AND DATEDIFF(DAY, d.DateSeen, r.DateSeen) < 3
AND d.RowNum+1 = r.RowNum
)
SELECT MIN(d.IDNumber) AS IDNumber, MAX(d.DateSeen) AS DateSeen, MAX(d.state) AS state
FROM Data d
GROUP BY d.GroupID

Resources