Removing duplicate rows based on date column and status - sql-server

I have the following table, where every row represents a change in the user's status and the occurrence date.
Date
ID
Status
02.01.2021
64
Register
02.04.2021
64
Active
02.07.2021
64
Not Active
02.10.2021
64
Active
02.25.2021
64
Active
02.30.2021
64
Not Active
03.03.2021
64
Active
01.01.2021
11
Register
01.06.2021
11
Active
01.07.2021
11
Active
01.10.2021
11
Elite
01.15.2021
11
Elite
It contains duplicate statues for different dates and I would like to retrieve only the latest status for when there are consequent statuses.
I want my end table to look like this:
Date
ID
Status
02.01.2021
64
Register
02.04.2021
64
Active
02.07.2021
64
Not Active
02.25.2021
64
Active
02.30.2021
64
Not Active
03.03.2021
64
Active
01.01.2021
11
Register
01.07.2021
11
Active
01.15.2021
11
Elite
Would appreciate any help on this.

You can use LEAD to check if the next row is different, and exclude it if it's the same
SELECT
Date,
ID,
Status
FROM (
SELECT *,
Prev = LEAD(Status) OVER (PARTITION BY ID ORDER BY Date)
FROM YourTable t
) t
WHERE Prev <> Status OR Prev IS NULL;
db<>fiddle

This seems like a "Gaps and Islands" problem as stated by a comment.
--SampleData
WITH mydata AS (
SELECT * FROM (VALUES
(1,'active',CAST(GETDATE()-1 AS Date))
,(1,'active',CAST(GETDATE()-2 AS Date))
,(1,'disabled',CAST(GETDATE()-3 AS Date))
,(1,'active',CAST(GETDATE()-4 AS Date))
,(1,'active',CAST(GETDATE()-5 AS Date))
,(2,'active',CAST(GETDATE()-1 AS Date))
,(2,'disabled',CAST(GETDATE()-2 AS Date))
,(2,'disabled',CAST(GETDATE()-3 AS Date))
,(2,'disabled',CAST(GETDATE()-4 AS Date))
,(2,'active',CAST(GETDATE()-5 AS Date))
) x(ID,Status,Date)
)
--Actual logic starts
,Islands AS (
SELECT *,
Island = ROW_NUMBER() OVER(PARTITION BY Id ORDER BY Id, Date) -
ROW_NUMBER() OVER (PARTITION BY Id, Status ORDER BY Id, Date)
FROM mydata
)
SELECT Id, Status, Island, MAX(Date) AS MaxDate
FROM Islands
GROUP BY Id, Status, Island
ORDER BY Id, MaxDate
In order to understand it, you will need to have a look at what Windowing functions are, and how they behave in this specific context.
The idea is to use scoped counters, one scoped by Id, the second by Id, Status.
The first one grows constantly and resets only when the Id changes, the second one resets also when the status is different.
By subtracting the two counters a "step" is created, and a new step is created every time the status changes while it remains constant if the status stays the same.
This "Step", called Island in the query can then be used to obtain the result you asked for.

Related

Write Query That Consider Date Interval

I have a table that contains Transactions of Customers.
I should Find Customers That had have at least 2 transaction with amount>20000 in Three consecutive days each month.
For example , Today is 2022/03/12 , I should Gather Data Of Transactions From 2022/02/13 To 2022/03/12, Then check These Data and See If a Customer had at least 2 Transaction With Amount>=20000 in Three consecutive days.
For Example, Consider Below Table:
Id
CustomerId
Transactiondate
Amount
1
1
2022-01-01
50000
2
2
2022_02_01
20000
3
3
2022_03_05
30000
4
3
2022_03_07
40000
5
2
2022_03_07
20000
6
4
2022_03_07
30000
7
4
2022_03_07
30000
The Out Put Should be : CustomerId =3 and CustomerId=4
I write query that Find Customer For Special day , but i don't know how to find these customers in one month with out using loop.
the query for special day is:
With cte (select customerid, amount, TransactionDate,Dateadd(day,-2,TransactionDate) as PrevDate
From Transaction
Where TransactionDate=2022-03-12)
Select CustomerId,Count(*)
From Cte
Where
TransactionDate>=Prevdate and TransactionDate<=TransactionDate
And Amount>=20000
Group By CustomerId
Having count(*)>=2
Hi there are many options how to achieve this.
I think that easies (from perfomance maybe not) is using LAG function:
WITH lagged_days AS (
SELECT
ISNULL(LAG(Transactiondate) OVER(PARTITION BY CustomerID ORDER BY id),
LEAD(Transactiondate) OVER(PARTITION BY CustomerID ORDER BY id)) lagged_dt
,*
FROM Transaction
), valid_cust_base as (
SELECT
*
FROM lagged_days
WHERE DATEPART(MONTH, lagged) = DATEPART(MONTH, Transactiondate)
AND datediff(day, Transactiondate, lagged_dt) <= 3
AND Amount >= 20000
)
SELECT
CustomerID
FROM valid_cust_base
GROUP BY CustomerID
HAVING COUNT(*) >= 2
First I have created lagged TransactionDate over customer (I assume that id is incremental). Then I have Selected only transactions within one month, with amount >= 20000 and where date difference between transaction is less then 4 days. Then just select customers who had more than 1 transaction.
In LAG First value is always missing per Customer missing, but you still need to be able say: 1st and 2nd transaction are within 3 days. Thats why I am replacing first NULL value with LEAD. It doesn't matter if you use:
ISNULL(LAG(Transactiondate) OVER(PARTITION BY CustomerID ORDER BY id),
LEAD(Transactiondate) OVER(PARTITION BY CustomerID ORDER BY id)) lagged_dt
OR
ISNULL(LEAD(Transactiondate) OVER(PARTITION BY CustomerID ORDER BY id),
LAG(Transactiondate) OVER(PARTITION BY CustomerID ORDER BY id)) lagged_dt
The main goal is to have for each transaction closest TransactionDate.

Getting the Min(startdate) and Max(enddate) for an ID when that ID shows up multiple times

I have a table with a column for ID, StartDate, EndDate, And whether or not there was a gap between the enddate of that row and the next start date. If there was only one set instance of that ID i know that I could just do
SELECT min(startdate),max(enddate)
FROM table
GROUP BY ID
However, I have multiple instances of these IDs in several non-connected timespans. So if I were to do that I would get the very first start date and the last enddate for a different set of time for that personID. How would I go about making sure I get the min a max dates for the specific blocks of time?
I thought about potentially creating a new column where it would have a number for each set of time. So for the first set of time that has no gaps, it would have 1, then when the next row has a gap it will add +1 corresponding to a new set of time. but I am not really sure how to go about that. Here is some sample data to illustrate what I am working with:
ID StartDate EndDate NextDate Gap_ind
001 1/1/2018 1/31/2018 2/1/2018 N
001 2/1/2018 2/30/2018 3/1/2018 N
001 3/1/2018 3/31/2018 5/1/2018 Y
001 5/1/2018 5/31/2018 6/1/2018 N
001 6/1/2018 6/30/2018 6/30/2018 N
This is a classic "gaps and islands" problem, where you are trying to define the boundaries of your islands, and which you can solve by using some windowing functions.
Your initial effort is on track. Rather than getting the next start date, though, I used the previous end date to calculate the groupings.
The innermost subquery below gets the previous end date for each of your date ranges, and also assigns a row number that we use later to keep our groupings in order.
The next subquery out uses the previous end date to identify which groups of date ranges go together (overlap, or nearly so).
The outermost query is the end result you're looking for.
SELECT
Grp.ID,
MIN(Grp.StartDate) AS GroupingStartDate,
MAX(Grp.EndDate) AS GroupingEndDate
FROM
(
SELECT
PrevDt.ID,
PrevDt.StartDate,
PrevDt.EndDate,
SUM(CASE WHEN DATEADD(DAY,1,PrevDt.PreviousEndDate) >= PrevDt.StartDate THEN 0 ELSE 1 END)
OVER (PARTITION BY PrevDt.ID ORDER BY PrevDt.RN) AS GrpNum
FROM
(
SELECT
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY StartDate, EndDate) as RN,
ID,
StartDate,
EndDate,
LAG(EndDate,1) OVER (PARTITION BY ID ORDER BY StartDate) AS PreviousEndDate
FROM
tbl
) AS PrevDt
) AS Grp
GROUP BY
Grp.ID,
Grp.GrpNum;
Results:
+-----+------------------+--------------+
| ID | InitialStartDate | FinalEndDate |
+-----+------------------+--------------+
| 001 | 2018-01-01 | 2018-03-01 |
| 001 | 2018-05-01 | 2018-06-01 |
+-----+------------------+--------------+
SQL Fiddle demo.
Further reading:
The SQL of Gaps and Islands in Sequences
Gaps and Islands Across Date Ranges
This is an example of a gaps-and-islands problem. A simple solution is to use lag() to determine if there are overlaps. When there is none, you have the start of a group. A cumulative sum defines the group -- and you aggregate on that.
select t.id, min(startdate), max(enddate)
from (select t.*,
sum(case when prev_enddate >= dateadd(day, -1, startdate)
then 0 else 1
end) over (partition by id order by startdate) as grp
from (select t.*, lag(enddate) over (partition by id order by startdate) as prev_enddate
from t
) t
) t
group by id, grp;

GROUP BY DAY, CUMULATIVE SUM

I have a table in MSSQL with the following structure:
PersonId
StartDate
EndDate
I need to be able to show the number of distinct people in the table within a date range or at a given date.
As an example i need to show on a daily basis the totals per day, e.g. if we have 2 entries on the 1st June, 3 on the 2nd June and 1 on the 3rd June the system should show the following result:
1st June: 2
2nd June: 5
3rd June: 6
If however e.g. on of the entries on the 2nd June also has an end date that is 2nd June then the 3rd June result would show just 5.
Would someone be able to assist with this.
Thanks
UPDATE
This is what i have so far which seems to work. Is there a better solution though as my solution only gets me employed figures. I also need unemployed on another column - unemployed would mean either no entry in the table or date not between and no other entry as employed.
CREATE TABLE #Temp(CountTotal int NOT NULL, CountDate datetime NOT NULL);
DECLARE #StartDT DATETIME
SET #StartDT = '2015-01-01 00:00:00'
WHILE #StartDT < '2015-08-31 00:00:00'
BEGIN
INSERT INTO #Temp(CountTotal, CountDate)
SELECT COUNT(DISTINCT PERSON.Id) AS CountTotal, #StartDT AS CountDate FROM PERSON
INNER JOIN DATA_INPUT_CHANGE_LOG ON PERSON.DataInputTypeId = DATA_INPUT_CHANGE_LOG.DataInputTypeId AND PERSON.Id = DATA_INPUT_CHANGE_LOG.DataItemId
LEFT OUTER JOIN PERSON_EMPLOYMENT ON PERSON.Id = PERSON_EMPLOYMENT.PersonId
WHERE PERSON.Id > 0 AND DATA_INPUT_CHANGE_LOG.Hidden = '0' AND DATA_INPUT_CHANGE_LOG.Approved = '1'
AND ((PERSON_EMPLOYMENT.StartDate <= DATEADD(MONTH,1,#StartDT) AND PERSON_EMPLOYMENT.EndDate IS NULL)
OR (#StartDT BETWEEN PERSON_EMPLOYMENT.StartDate AND PERSON_EMPLOYMENT.EndDate) AND PERSON_EMPLOYMENT.EndDate IS NOT NULL)
SET #StartDT = DATEADD(MONTH,1,#StartDT)
END
select * from #Temp
drop TABLE #Temp
You can use the following query. The cte part is to generate a set of serial dates between the start date and end date.
DECLARE #ViewStartDate DATETIME
DECLARE #ViewEndDate DATETIME
SET #ViewStartDate = '2015-01-01 00:00:00.000';
SET #ViewEndDate = '2015-02-25 00:00:00.000';
;WITH Dates([Date])
AS
(
SELECT #ViewStartDate
UNION ALL
SELECT DATEADD(DAY, 1,Date)
FROM Dates
WHERE DATEADD(DAY, 1,Date) <= #ViewEndDate
)
SELECT [Date], COUNT(*)
FROM Dates
LEFT JOIN PersonData ON Dates.Date >= PersonData.StartDate
AND Dates.Date <= PersonData.EndDate
GROUP By [Date]
Replace the PersonData with your table name
If startdate and enddate columns can be null, then you need to add
addditional conditions to the join
It assumes one person has only one record in the same date range
You could do this by creating data where every start date is a +1 event and end date is -1 and then calculate a running total on top of that.
For example if your data is something like this
PersonId StartDate EndDate
1 20150101 20150201
2 20150102 20150115
3 20150101
You first create a data set that looks like this:
EventDate ChangeValue
20150101 +2
20150102 +1
20150115 -1
20150201 -1
And if you use running total, you'll get this:
EventDate Total
2015-01-01 2
2015-01-02 3
2015-01-15 2
2015-02-01 1
You can get it with something like this:
select
p.eventdate,
sum(p.changevalue) over (order by p.eventdate asc) as total
from
(
select startdate as eventdate, sum(1) as changevalue from personnel group by startdate
union all
select enddate, sum(-1) from personnel where enddate is not null group by enddate
) p
order by p.eventdate asc
Having window function with sum() requires SQL Server 2012. If you're using older version, you can check other options for running totals.
My example in SQL Fiddle
If you have dates that don't have any events and you need to show those too, then the best option is probably to create a separate table of dates for the whole range you'll ever need, for example 1.1.2000 - 31.12.2099.
-- Edit --
To get count for a specific day, it's possible use the same logic, but just sum everything up to that day:
declare #eventdate date
set #eventdate = '20150117'
select
sum(p.changevalue)
from
(
select startdate as eventdate, 1 as changevalue from personnel
where startdate <= #eventdate
union all
select enddate, -1 from personnel
where enddate < #eventdate
) p
Hopefully this is ok, can't test since SQL Fiddle seems to be unavailable.

How to select a specific set of rows and then select next row relevant to the original row

I don't know if what i'm looking for it's possible with my current dataset, or if what i'm expecting it's possible at all.
what i am trying to accomplish is to get all rows with status = 2 or 7 get the date and then get the next row with different status to obtain the dateinterval and get the nuber of days that the status had.
DataSet
id_compromiso|fecha |id_actividad|status
-------------+-----------+------------+----------
32 2013-12-10 359 2
32 2013-12-16 380 5
32 2013-12-18 401 7
32 2013-12-24 485 8
58 2013-12-02 248 2
58 2013-12-03 254 2
58 2013-12-10 360 2
58 2013-12-10 378 5
58 2013-12-12 395 2
what have i tried:
SQL query:
WITH pausa AS (
SELECT tmp.id_compromiso, tmp.fecha, MIN(tact.id_actividad) as id_actividad
FROM Actividades as tact
INNER JOIN (
SELECT act.id_compromiso, CAST(act.fecha as date) as fecha
FROM actividades as act
WHERE act.[status]=7
) as tmp
ON(tmp.id_compromiso = tact.id_compromiso AND tmp.fecha = CAST(tact.fecha as date))
WHERE tact.[status]=7
GROUP BY tmp.id_compromiso, tmp.fecha
),
revision AS (
SELECT tmp.id_compromiso, tmp.fecha, MIN(tact.id_actividad) as id_actividad
FROM Actividades as tact
INNER JOIN (
SELECT act.id_compromiso, CAST(act.fecha as date) as fecha
FROM actividades as act
WHERE act.[status]=2
) as tmp
ON(tmp.id_compromiso = tact.id_compromiso AND tmp.fecha = CAST(tact.fecha as date))
WHERE tact.[status]=2
GROUP BY tmp.id_compromiso, tmp.fecha
)
SELECT * FROM revision ORDER BY id_compromiso;
but really running i'm out of ideas on how to get the next item with different status from the table ...
-- First, it extends actividades to include the minimum fecha for the status
-- on the compromiso; this is min(fecha) in the partition by compromiso/status
WITH status_start AS(
SELECT *, MIN(fecha) OVER (PARTITION BY id_compromiso, status) sStart
FROM actividades
),
-- Then, join the extended actividades table with itself (aliased a and b) by compromiso but status 2,7 with status not 2,7
-- (this is the AND a.STATUS IN (2,7) AND b.STATUS NOT IN(2,7) in the join clause)
-- and making sure it's a later status (the a.sStart <b.sStart bit)
-- at this point also calculates the date difference in days
status_start_end AS(
SELECT a.*,b.sStart sEnd, DATEDIFF(d, a.sStart, b.sStart) AS sDiff FROM status_start a
JOIN status_start b ON (a.id_compromiso =b.id_compromiso AND a.STATUS IN (2,7) AND b.STATUS NOT IN(2,7) AND a.sStart <b.sStart))
-- Finaly as the previous query would have day difference in relation to ALL later status, we need to select only the minimum difference
-- as this is when the status actually change. We also need to eliminate duplicates using 'distinct;
-- as it could be many entries for the same status and
-- also many later status.
SELECT DISTINCT id_compromiso, status ,
MIN(sDiff) OVER (PARTITION BY id_compromiso) "Nr. of days in status"
FROM status_start_end
Without knowing more about the context in question it's difficult to provide a fitting answer, but something like this may help:
SELECT TOP 1 id_compromiso, fecha, id_actividad, status
FROM Actividades
WHERE CAST(fecha AS DATE)>( SELECT MAX(CAST(fecha AS DATE))
FROM Actividades
WHERE status IN (2,7))
AND status NOT IN (2,7)
ORDER BY CAST(fecha AS DATE) DESC
I have set up a SQL Fiddle here.

Group by on Postgresql Date Time

Hy. There are employee records in my postgresql database something like
CODE DATE COUNT
"3443" "2009-04-02" 3
"3444" "2009-04-06" 1
"3443" "2009-04-06" 1
"3443" "2009-04-07" 7
I want to use a query "SELECT ALL CODES AND COUNT THEM THAT OCCURRED IN THE MONTH"
RESULT:
CODE DATE COUNT
"3443" "2009-04" 3
"3441" "2009-04" 13
"3442" "2009-04" 11
"3445" "2009-04" 72
I did use a query i.e.
SELECT CODE,date_part('month',DATE),count(CODE)
FROM employee
where
group by CODE,DATE
The above query runs fine but the months listed in the records are in form of numbers and its hard to find that a month belongs to which year. In short I want to get the result just like mention above in the RESULT section. Thanks
Try this:
SELECT CODE, to_char(DATE, 'YYYY-MM'), count(CODE)
FROM employee
where
group by CODE, to_char(DATE, 'YYYY-MM')
Depending on whether you want the result as text or a date, you can also write it like this:
SELECT CODE, date_trunc('month', DATE), COUNT(*)
FROM employee
GROUP BY CODE, date_trunc('month', DATE);
Which in your example would return this, with DATE still a timestamp, which can be useful if you are going to do further calculations on it since no conversions are necessary:
CODE DATE COUNT
"3443" "2009-04-01" 3
"3441" "2009-04-01" 13
"3442" "2009-04-01" 11
"3445" "2009-04-01" 72
date_trunc() also accepts other values, for instance quarter, year etc.
See the documentation for all values
Try any of
SELECT CODE,count(CODE),
DATE as date_normal,
date_part('year', DATE) as year,
date_part('month', DATE) as month,
to_timestamp(
date_part('year', DATE)::text
|| date_part('month', DATE)::text, 'YYYYMM')
as date_month
FROM employee
where
group by CODE,DATE;

Resources