Related
I have been tasked with building a history table in SQL. I have already built the base table which contains multiple left joins amongst other things. The base table will need to be compared to another table and only update specific columns that have changed, insert new rows where the key doesn't match.
Previously I have used other ETL tools which have GUI style built in SCD loaders, but I don't have such luxury in SQL Server. Here the merge statement can handle such operations. I have used the MERGE statement before, but I become a bit stuck when handling flags and date fields based on the operation performed.
Here is the BASE table
KEY
CLIENT
QUANTITY
CONTRACT_NO
FC_COUNT
DELETE_FLAG
RECORD_UPDATED_DATE
345
A
1000
5015
1
N
31/12/9999
346
B
2000
9352
1
N
31/12/9999
347
C
3000
6903
1
N
31/12/9999
348
D
1000
7085
1
N
31/12/9999
349
E
1000
8488
1
N
31/12/9999
350
F
500
6254
1
N
31/12/9999
Here is the table I plan to merge with
KEY
CLIENT
QUANTITY
CONTRACT_NO
FC_COUNT
345
A
1299
5015
1
346
B
2011
9352
1
351
Z
5987
5541
1
The results I'm looking for are
KEY
CLIENT
QUANTITY
CONTRACT_NO
FC_COUNT
DELETE_FLAG
RECORD_UPDATED_DATE
345
A
1000
5015
1
N
06/07/2022
345
A
1299
5015
1
N
31/12/9999
346
B
2000
9352
1
N
06/07/2022
346
B
2011
9352
1
N
31/12/9999
347
C
3000
6903
1
Y
06/07/2022
348
D
1000
7085
1
Y
06/07/2022
349
E
1000
8488
1
Y
06/07/2022
350
F
500
6254
1
Y
06/07/2022
351
Z
5987
5541
1
N
31/12/9999
As we can see I have shown the changes, closed off the old records, marked with a date and a delete flag if they are missing but was there previous, as well as new new row with the new key and data
Would this be a MERGE? Some direction on how to perform this sort of operation would be a great help. We have a lot of tables where we need to keep change history and this will help a lot going forward.
code shell attempt
SELECT
MAIN_KEY,
CLIENT,
QUANTITY,
CONTRACT_NO,
1 AS FC_COUNT,
NULL as DELETE_FLG_DD,
GETDATE() as RECORD_UPDATED_DATE
INTO #G1_DELTA
FROM
[dwh].STG_DTL
MERGE [dwh].[PRJ1_DELTA] TARGET
USING #G1_DELTA SOURCE
ON TARGET.MAIN_KEY = SOURCE.MAIN_KEY
WHEN MATCHED THEN INSERT
(
MAIN_KEY,
CLIENT,
QUANTITY,
CONTRACT_NO,
FC_COUNT,
DELETE_FLG_DD,
RECORD_UPDATED_DATE
)
VALUES
(
SOURCE.MAIN_KEY,
SOURCE.CLIENT,
SOURCE.QUANTITY,
SOURCE.CONTRACT_NO,
SOURCE.FC_COUNT,
SOURCE.DELETE_FLG_DD,
SOURCE.RECORD_UPDATED_DATE
)
If you need to build a history table containing the updated information from your two tables, you first need to select updated information from your two tables.
The changes that need to be applied to your tables are on:
"tab1.[DELETE_FLAG]", that should be updated to 'Y' whenever it has a match with tab2
"tab1.[RECORD_UPDATED_DATE]", that should be updated to the current date
"tab2.[DELETE_FLAG]", missing and that should be initialized to N
"tab2.[RECORD_UPDATED_DATE]", missing and that should be initialized to your random date 9999-12-31.
Once these changes are made, you can apply the UNION ALL to get the rows from your two tables together.
Then, in order to generate a table, you can use a cte to select the output result set and use the INTO <table> clause after a selection to generate your "history" table.
WITH cte AS (
SELECT tab1.[KEY],
tab1.[CLIENT],
tab1.[QUANTITY],
tab1.[CONTRACT_NO],
tab1.[FC_COUNT],
CASE WHEN tab2.[KEY] IS NOT NULL
THEN 'N'
ELSE 'Y'
END AS [DELETE_FLAG],
CAST(GETDATE() AS DATE) AS [RECORD_UPDATED_DATE]
FROM tab1
LEFT JOIN tab2
ON tab1.[KEY] = tab2.[KEY]
UNION ALL
SELECT *,
'N' AS [DELETE_FLAG],
'9999-12-31' AS [RECORD_UPDATED_DATE]
FROM tab2
)
SELECT *
INTO history
FROM cte
ORDER BY [KEY];
Check the demo here.
in the below table example - Table A, we have entries for four different ID's 1,2,3,4 with the respective status and its time. I wanted to find the "ID" which took the maximum amount of time to change the "Status" from Started to Completed. In the below example it is ID = 4. I wanted to run a query and find the results, where we currently has approximately million records in a table. It would be really great, if someone provide an effective way to retrieve this data.
Table A
ID Status Date(YYYY-DD-MM HH:MM:SS)
1. Started 2017-01-01 01:00:00
1. Completed 2017-01-01 02:00:00
2. Started 2017-10-02 03:00:00
2. Completed 2017-10-02 05:00:00
3. Started 2017-15-03 06:00:00
3. Completed 2017-15-03 09:00:00
4. Started 2017-22-04 10:00:00
4. Completed 2017-22-04 15:00:00
Thanks!
Bruce
You can query as below:
Select top 1 with ties Id from #yourDate y1
join #yourDate y2
On y1.Id = y2.Id
and y1.[STatus] = 'Started'
and y2.[STatus] = 'Completed'
order by Row_number() over(order by datediff(mi,y1.[Date], y2.[date]) desc)
SELECT
started.ID, timediff(completed.date, started.date) as elapsed_time
FROM TABLE_A as started
INNER JOIN TABLE_A as completed ON (completed.ID=started.ID AND completed.status='Completed')
WHERE started.status='Started'
ORDER BY elapsed_time desc
be sure there's a index on TABLE_A for the columns ID, date
I haven't run this sql but it may solve your problem.
select a.id, max(DATEDIFF(SECOND, a.date, b.date + 1)) from TableA as a
join TableA as b on a.id = b.id
where a.status="started" and b.status="completed"
Here's a way with a correlated sub-query. Just uncomment the TOP 1 to get ID 4 in this case. This is based off your comments that there is only 1 "started" record, but could be multiple "completed" records for each ID.
declare #TableA table (ID int, Status varchar(64), Date datetime)
insert into #TableA
values
(1,'Started','2017-01-01 01:00:00'),
(1,'Completed','2017-01-01 02:00:00'),
(2,'Started','2017-02-10 03:00:00'),
(2,'Completed','2017-02-10 05:00:00'),
(3,'Started','2017-03-15 06:00:00'),
(3,'Completed','2017-03-15 09:00:00'),
(4,'Started','2017-04-22 10:00:00'),
(4,'Completed','2017-04-22 15:00:00')
select --top 1
s.ID
,datediff(minute,s.Date,e.EndDate) as TimeDifference
from #TableA s
inner join(
select
ID
,max(Date) as EndDate
from #TableA
where Status = 'Completed'
group by ID) e on e.ID = s.ID
where
s.Status = 'Started'
order by
datediff(minute,s.Date,e.EndDate) desc
RETURNS
+----+----------------+
| ID | TimeDifference |
+----+----------------+
| 4 | 300 |
| 3 | 180 |
| 2 | 120 |
| 1 | 60 |
+----+----------------+
If you know that 'started' will always be the earliest point in time for each ID and the last 'completed' record you are considering will always be the latest point in time for each ID, the following should have good performance for a large number of records:
SELECT TOP 1
id
, DATEDIFF(s, MIN([Date]), MAX([date])) AS Elapsed
FROM #TableA
GROUP BY ID
ORDER BY DATEDIFF(s, MIN([Date]), MAX([date])) DESC
In our company, our clients perform various activities that we log in different tables - Interview attendance, Course Attendance, and other general activities.
I have a database view that unions data from all of these tables giving us the ActivityView that looks like this.
As you can see some activities overlap - for example while attending an interview, a client may have been performing a CV update activity.
+----------------------+---------------+---------------------+-------------------+
| activity_client_id | activity_type | activity_start_date | activity_end_date |
+----------------------+---------------+---------------------+-------------------+
| 112 | Interview | 2015-06-01 09:00 | 2015-06-01 11:00 |
| 112 | CV updating | 2015-06-01 09:30 | 2015-06-01 11:30 |
| 112 | Course | 2015-06-02 09:00 | 2015-06-02 16:00 |
| 112 | Interview | 2015-06-03 09:00 | 2015-06-03 10:00 |
+----------------------+---------------+---------------------+-------------------+
Each client has a "Sign Up Date", recorded on the client table, which is when they joined our programme. Here it is for our sample client:
+-----------+---------------------+
| client_id | client_sign_up_date |
+-----------+---------------------+
| 112 | 2015-05-20 |
+-----------+---------------------+
I need to create a report that will show the following columns:
+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+
We need this report in order to see how effective our programme is. An important aim of the programme is that we get every client to complete at least 5 hours of activity as quickly as possible.
So this report will tell us how long from sign up does it take each client to achieve this figure.
What makes this even trickier is that when we calculate 5 hours of total activity, we must discount overlapping activities:
In the sample data above the client attended an interview between 09:00 and 11:00.
On the same day they also performed CV updating activity from 09:30 to 11:30.
For our calculation, this would give them total activity for the day of 2.5 hours (150 minutes) - we would only count 30 minutes of the CV updating as the Interview overlaps it up to 11:00.
So the report for our sample client would give the following result:
+-----------+---------------------+--------------------------------------------+
| client_id | client_sign_up_date | date_client_completed_5_hours_of_activity |
+-----------+---------------------+--------------------------------------------+
| 112 | 2015-05-20 | 2015-06-02 |
+-----------+---------------------+--------------------------------------------+
So my question is how can I create the report using a select statement ?
I can work out how to do this by writing a stored procedure that will loop through the view and write the result to a report table.
But I would much prefer to avoid a stored procedure and have a select statement that will give me the report on the fly.
I am using SQL Server 2005.
See SQL Fiddle here.
with tbl as (
-- this will generate daily merged ovelaping time
select distinct
a.id
,(
select min(x.starttime)
from act x
where x.id=a.id and ( x.starttime between a.starttime and a.endtime
or a.starttime between x.starttime and x.endtime )
) start1
,(
select max(x.endtime)
from act x
where x.id=a.id and ( x.endtime between a.starttime and a.endtime
or a.endtime between x.starttime and x.endtime )
) end1
from act a
), tbl2 as
(
-- this will add minute and total minute column
select
*
,datediff(mi,t.start1,t.end1) mi
,(select sum(datediff(mi,x.start1,x.end1)) from tbl x where x.id=t.id and x.end1<=t.end1) totalmi
from tbl t
), tbl3 as
(
-- now final query showing starttime and endtime for 5 hours other wise null in case not completed 5(300 minutes) hours
select
t.id
,min(t.start1) starttime
,min(case when t.totalmi>300 then t.end1 else null end) endtime
from tbl2 t
group by t.id
)
-- final result
select *
from tbl3
where endtime is not null
This is one way to do it:
;WITH CTErn AS (
SELECT activity_client_id, activity_type,
activity_start_date, activity_end_date,
ROW_NUMBER() OVER (PARTITION BY activity_client_id
ORDER BY activity_start_date) AS rn
FROM activities
),
CTEdiff AS (
SELECT c1.activity_client_id, c1.activity_type,
x.activity_start_date, c1.activity_end_date,
DATEDIFF(mi, x.activity_start_date, c1.activity_end_date) AS diff,
ROW_NUMBER() OVER (PARTITION BY c1.activity_client_id
ORDER BY x.activity_start_date) AS seq
FROM CTErn AS c1
LEFT JOIN CTErn AS c2 ON c1.rn = c2.rn + 1
CROSS APPLY (SELECT CASE
WHEN c1.activity_start_date < c2.activity_end_date
THEN c2.activity_end_date
ELSE c1.activity_start_date
END) x(activity_start_date)
)
SELECT TOP 1 client_id, client_sign_up_date, activity_start_date,
hoursOfActivicty
FROM CTEdiff AS c1
INNER JOIN clients AS c2 ON c1.activity_client_id = c2.client_id
CROSS APPLY (SELECT SUM(diff) / 60.0
FROM CTEdiff AS c3
WHERE c3.seq <= c1.seq) x(hoursOfActivicty)
WHERE hoursOfActivicty >= 5
ORDER BY seq
Common Table Expressions and ROW_NUMBER() were introduced with SQL Server 2005, so the above query should work for that version.
Demo here
The first CTE, i.e. CTErn, produces the following output:
client_id activity_type start_date end_date rn
112 Interview 2015-06-01 09:00 2015-06-01 11:00 1
112 CV updating 2015-06-01 09:30 2015-06-01 11:30 2
112 Course 2015-06-02 09:00 2015-06-02 16:00 3
112 Interview 2015-06-03 09:00 2015-06-03 10:00 4
The second CTE, i.e. CTEdiff, uses the above table expression in order to calculate time difference for each record, taking into consideration any overlapps with the previous record:
client_id activity_type start_date end_date diff seq
112 Interview 2015-06-01 09:00 2015-06-01 11:00 120 1
112 CV updating 2015-06-01 11:00 2015-06-01 11:30 30 2
112 Course 2015-06-02 09:00 2015-06-02 16:00 420 3
112 Interview 2015-06-03 09:00 2015-06-03 10:00 60 4
The final query calculates the cumulative sum of time difference and selects the first record that exceeds 5 hours of activity.
The above query will work for simple interval overlaps, i.e. when just the end date of an activity overlaps the start date of the next activity.
A Geometric Approach
For another issue, I've taken a geometric approach to date
packing. Namely, I convert dates and times to a sql geometry
type and utilize geometry::UnionAggregate to merge the ranges.
I don't believe this will work in sql-server 2005. But your
problem was such an interesting puzzle that I wanted to see
whether the geometrical approach would work. So any future
users running into this problem that have access to a later
version can consider it.
Code Description
In 'numbers':
I build a table representing a sequence
Swap it out with your favorite way to make a numbers table.
For a union operation, you won't ever need more rows than in
your original table, so I just use it as the base to build it.
In 'mergeLines':
I convert the dates to floats and use those floats
to create geometrical points.
I then connect these points via STUnion and STEnvelope.
Finally, I merge all these lines via UnionAggregate. The resulting
'lines' geometry object might contain multiple lines, but if they
overlap, they turn into one line.
In 'redate':
I use the numbers CTE to extract the individual lines inside 'lines'.
I envelope the lines which here ensures that the lines are stored
only as its two endpoints.
I read the endpoint x values and convert them back to their time
representations (This is usually the end goal, but you need more).
I calculate the difference in minutes between activity start and
end dates (I do this first in seconds then divide by 60 for the
sake of a precision issue).
I calculate the cumulative sume of these minutes for each row.
In the outer query:
I align the previous cumulative minutes sum with each current row
I filter for the row where the 5hr goal was met but where the
previous minutes shows that the 5hr goal for the previous row
was not met.
I then calculate where in the current row's range the user has
met the 5 hours, to not only arrive at the date the five hour
goal was met, but the exact time.
The Code
with
numbers as (
select row_number() over (order by (select null)) i
from #activities -- where I put your data
),
mergeLines as (
select activity_client_id,
lines = geometry::UnionAggregate(line)
from #activities
cross apply (select
startP = geometry::Point(convert(float,activity_start_date), 0, 0),
stopP = geometry::Point(convert(float,activity_end_date), 0, 0)
) pointify
cross apply (select line = startP.STUnion(stopP).STEnvelope()) lineify
group by activity_client_id
),
redate as (
select client_id = activity_client_id,
activities_start_date,
activities_end_date,
minutes,
rollingMinutes = sum(minutes) over(
partition by activity_client_id
order by activities_start_date
rows between unbounded preceding and current row
)
from mergeLines ml
join numbers n on n.i between 1 and ml.lines.STNumGeometries()
cross apply (select line = ml.lines.STGeometryN(i).STEnvelope()) l
cross apply (select
activities_start_date = convert(datetime, l.line.STPointN(1).STX),
activities_end_date = convert(datetime, l.line.STPointN(3).STX)
) unprepare
cross apply (select minutes =
round(datediff(s, activities_start_date, activities_end_date) / 60.0,0)
) duration
)
select client_id,
activities_start_date,
activities_end_date,
met_5hr_goal = dateadd(minute, (60 * 5) - prevRoll, activities_start_date)
from (
select *,
prevRoll = lag(rollingMinutes) over (
partition by client_id
order by rollingMinutes
)
from redate
) ranker
where rollingMinutes >= 60 * 5
and prevRoll < 60 * 5;
I have had asked a similar question here and have got help from jpw who helped me with the query. The situation here remains same but only a bit more detail added. I have four tables. Sample structure for three of them is given below:
I have been helped to form query which goes as below:
select
d.LOTQty,
ApprovedQty = count(d.SerialNo),
d.DispatchDate,
Installed = count(a.SerialNo) + count(r.SerialNo)
from
Despatch d
left join
Activation a
on d.SerialNo= a.SerialNo
and d.DispatchDate <= a.ActivationDate
and d.LOTQty = a.LOTQty
left join
Replaced r
on d.SerialNo= r.SerialNo
and d.DispatchDate <= r.ActivationDate
and (a.ActivationDate is null or a.ActivationDate < d.DispatchDate)
where
d.LOTQty = 15
group by
d.LOTQty, d.DispatchDate, d.STBModel
For understanding sake, above query match Despatch table's SerialNo with Activation table. If match found it checks for Date difference. If DespatchDate < ActivationDate only those numbers are considered while others(which didn't match or whose DispatchDate > ActivationDate) are matched with Replaced with similar date criteria. So at the end we find 9 matches i.e 7 from Activation and 2 from Replaced as below:
LotQty | ApprovedQty | DispatchDate | Installed
15 | 10 | 2013-8-7 | 9
I want to display two more columns in here i.e DOA and Bounce like this:
LotQty | ApprovedQty | DispatchDate | Installed | DOA | Bounce
15 | 10 | 2013-8-7 | 9 | 2 | 4
DOA and Bounce should be calculated with difference between 4th table i.e Failed table's FailedDate and the above 9 matched SerialNo's respective Activation/Record date(henceforth termed as act_rec_date). Failed table and Intermediate 9 matched SerialNo's structure is shown below:
Intermediate table doesn't physically exist. It is just for reference and to provide more clarity. Intermediate table contain those SerialNo, which were matched with Activation and Replaced table. The act_rec_Date field is correspondingly matched Activation/Record Date.
DOA & Bounce = We should match all the 9 resultant SerialNo's(i.e Intermediate table) with Failed table. If matched, calculate difference between FailedDate and act_rec_date. If difference is (0 to <=10 days) then count it under DOA and if difference is (>10 days to <=180 days) then count it under Bounce. From Failed we find 6 matches out of which Product1,2 falls in DOA as difference between act_rec_Date is 0 and Product7,8,9 & 10 falls under Bounce as their difference is 89 | 54 | 61 | 61. So as shown above DOA = 2 and Bounce = 4
I want to build a query which could give me DOA and Bounce as well. I tried creating a temp table and dumping the resultant SerialNo's and act_rec_Date into it. Next I tried to match temp table and Failed table. I couldn't get it working and further more it took around 7 minutes to even execute the query.
P.S- My Actual tables contain around 50k to 100k data entries.
Continuing on the previous query I think the new columns could be added with a conditional aggregation in the select statement and another left join for the failed table.
This should work, but I'm sure the query can be improved:
select
d.LOTQty,
ApprovedQty = count(d.SerialNo),
d.DispatchDate,
Installed = count(a.SerialNo) + count(r.NewSerialNo),
DOA = sum(case when datediff(day, coalesce(a.ActivationDate,r.RecordDate), f.FailedDate) <= 10 then 1 else 0 end),
Bounce = sum(case when datediff(day, coalesce(a.ActivationDate,r.RecordDate), f.FailedDate) between 11 and 180 then 1 else 0 end)
from
Despatch d
left join
Activation a
on d.SerialNo= a.SerialNo
and d.DispatchDate <= a.ActivationDate
and d.LOTQty = a.LOTQty
left join
Replaced r
on d.SerialNo= r.NewSerialNo
and d.DispatchDate <= r.RecordDate
and (a.ActivationDate is null or a.ActivationDate < d.DispatchDate)
left join
Failed f
on (f.FailedSINo = a.SerialNo)
or (f.FailedSINo = r.NewSerialNo)
where
d.LOTQty = 15
group by
d.LOTQty, d.DispatchDate
Sample SQL Fiddle with test data
Say I have to following PaymentTransaction Table:
ID Amount PayMethodID
----------------------------
10254 100 1
15789 150 1
15790 200 0
16954 300 0
17864 400 1
19364 500 1
PayMethodID Desc
----------------------------
0 CASH
1 VISA
2 MASTER
3 AMEX
4 ETC
I can simply use a group by to group the PayMethodID under 1 and 0.
What i am trying to do is to show also the non-exist PayMethodID under GROUP BY
My current result with simple group by statement is
PayMethodID TotalAmount
-------------------------
0 500
1 1150
Expected result (to show 0 if its not exits in the transaction table):
PayMethodID TotalAmount
-------------------------
0 500
1 1150
2 0
3 0
4 0
This might be a simple and duplicated question, but i just cant find the keyword to search around. I would remove this post if you can find me any duplication. Thanks.
You can use LEFT JOIN, so all rows from leftmost table (TableA) will be shown whether it has a matching values on the other table or not.
SELECT a.PayMethodID,
TotalAmount = ISNULL(SUM(b.Amount), 0)
FROM TableA AS a -- <== contains list of card type
LEFT JOIN TableB AS b -- <== contains the payment list
ON a.PayMethodID = b.PayMethodID
GROUP BY a.PayMethodID
A regular OUTER (LEFT) JOIN will give you all rows from the PayMethod table no matter if they exist in the PaymentTransaction table, the rest of the sums being NULL. You can then use a COALESCE to make the null rows zero;
SELECT pm.PayMethodID, COALESCE(SUM(pt.Amount), 0) TotalAmount
FROM PayMethod pm
LEFT JOIN PaymentTransaction pt
ON pm.PayMethodID = pt.PayMethodID
GROUP BY pm.PayMethodID
An SQLfiddle to test with.