Bigquery - Add full date range to each id

Bigquery - Add full date range to each id - arrays

How can i apply GENERATE_DATE_ARRAY(start_date, end_date[, INTERVAL INT64_expr date_part]) to each record in a dataset. I understand how to apply it to get a single date range from start to end, but don't know how to apply the same date array to each id.
Say i have two distinct ID's x and y with the following dates:
|id|date
--------------
1 |x |2021-01-01
2 |x |2021-01-03
3 |y |2021-01-06
4 |y |2021-01-09
and i want to fill in the date gap for each ID
How can i achieve the following output?
|id|date
--------------
1 |x |2021-01-01
2 |x |2021-01-02
3 |x |2021-01-03
4 |y |2021-01-06
5 |y |2021-01-07
6 |y |2021-01-08
7 |y |2021-01-09

Below is for BigQuery Standard SQL
select id, date from (
select id, date, lead(date) over(partition by id order by date) next_date
from `project.dataset.table`
), unnest(generate_date_array(date, next_date)) date
where not next_date is null
-- order by date
if to apply to sample data from your question - output is

Try the following in standard SQL in BigQuery:
with data as (
select 'x' as id, date '2021-01-01' as date
UNION ALL
select 'x' as id, date '2021-01-03' as date
UNION ALL
select 'y' as id, date '2021-01-06' as date
UNION ALL
select 'y' as id, date '2021-01-09' as date
)
select d1.id, date
from data d1
join data d2
on d1.id = d2.id
and d1.date < d2.date, unnest(GENERATE_DATE_ARRAY(d1.date, d2.date, INTERVAL 1 DAY)) as date;

Related

Count Distinct Persons Per Year, but Only Once

How would I select distinct persons per year, but only count each person once.
An example of my data is:
ID Date
1 20NOV2018
2 06JUN2017
2 29JUL2011
3 05MAY2014
4 04APR2002
4 25APR2009
I want my output to look like:
2002 1
2009 0
2011 1
2014 1
2017 0
2018 1

Use sub-selects to left join the distinct years with the first year an id occurs and count the ids from that.
data have;
input
id date date9.; format date date9.; datalines;
1 20NOV2018
2 06JUN2017
2 29JUL2011
3 05MAY2014
4 04APR2002
4 25APR2009
run;
proc sql;
create table want as
select each.year, count(first.id) as appeared_count
from
( select distinct year(date) as year
from have
) as each
left join
( select id, min(year(date)) as year
from have group by id
) as first
on each.year = first.year
group by each.year
order by each.year
;

Hope this Code Works Fine for Your case:
SELECT YEAR(M.DATE) AS DATE,COUNT(S.ID)
FROM #TAB M
LEFT JOIN (SELECT MIN(YEAR(DATE)) AS DATE ,ID
FROM #TAB GROUP BY ID) S ON YEAR(M.DATE)=S.DATE GROUP BY YEAR(M.DATE) ORDER BY YEAR(M.DATE)

return amount per year/month records based on start and enddate

I have a table with, for example this data:
ID |start_date |end_date |amount
---|------------|-----------|--------
1 |2019-03-21 |2019-05-09 |10000.00
2 |2019-04-02 |2019-04-10 |30000.00
3 |2018-11-01 |2019-01-08 |20000.00
I would like te get the splitted records back with the correct calculated amount based on the year/month.
I expect the outcome to be like this:
ID |month |year |amount
---|------|-------|--------
1 |3 | 2019 | 2200.00
1 |4 | 2019 | 6000.00
1 |5 | 2019 | 1800.00
2 |4 | 2019 |30000.00
3 |11 | 2018 | 8695.65
3 |12 | 2018 | 8985.51
3 |1 | 2019 | 2318.84
What would be the best way to achieve this? I think you would have to use DATEDIFF to get the number of days between the start_date and end_date to calculate the amount per day, but I'm not sure how to return it as records per month/year.
Tnx in advance!

This is one idea. I use a Tally to create a day for every day the amount is relevant for for that ID. Then, I aggregate the value of the Amount divided by the numbers of days, which is grouped by Month and year:
CREATE TABLE dbo.YourTable(ID int,
StartDate date,
EndDate date,
Amount decimal(12,2));
GO
INSERT INTO dbo.YourTable (ID,
StartDate,
EndDate,
Amount)
VALUES(1,'2019-03-21','2019-05-09',10000.00),
(2,'2019-04-02','2019-04-10',30000.00),
(3,'2018-11-01','2019-01-08',20000.00);
GO
--Create a tally
WITH N AS(
SELECT N
FROM (VALUES(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL),(NULL))N(N)),
Tally AS(
SELECT TOP (SELECT MAX(DATEDIFF(DAY, t.StartDate, t.EndDate)+1) FROM dbo.YourTable t) --Limits the rows, might be needed in a large dataset, might not be, remove as required
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) -1 AS I
FROM N N1, N N2, N N3), --1000 days, is that enough?
--Create the dates
Dates AS(
SELECT YT.ID,
DATEADD(DAY, T.I, YT.StartDate) AS [Date],
YT.Amount,
COUNT(T.I) OVER (PARTITION BY YT.ID) AS [Days]
FROM Tally T
JOIN dbo.YourTable YT ON T.I <= DATEDIFF(DAY, YT.StartDate, YT.EndDate))
--And now aggregate
SELECT D.ID,
DATEPART(MONTH,D.[Date]) AS [Month],
DATEPART(YEAR,D.[Date]) AS [Year],
CONVERT(decimal(12,2),SUM(D.Amount / D.[Days])) AS Amount
FROM Dates D
GROUP BY D.ID,
DATEPART(MONTH,D.[Date]),
DATEPART(YEAR,D.[Date])
ORDER BY D.ID,
[Year],
[Month];
GO
DROP TABLE dbo.YourTable;
GO
DB<>Fiddle

Row to Column Group by ID SQL Server

I have a table like this:
id name date
-------------------
1 Adam 2018-10-01
1 Adam 2018-08-01
2 Eve 2018-07-01
2 Eve 2018-05-01
I want it to become like this:
id name firstdate lastdate
--------------------------------
1 Adam 2018-08-01 2018-10-01
2 Eve 2018-05-01 2018-07-01
I tried to use this query but it failed:
SELECT * FROM View_MySource
PIVOT (
MIN(mydate)
FOR id IN ([firstdate], [lastdate])
) piv
I am new to pivot, can someone help me?

I might even avoid using PIVOT in this case:
SELECT
id,
name,
MIN(date) AS firstdate,
MAX(date) AS maxdate
FROM View_MySource
GROUP BY
id,
name;

This has actually nothing to do with pivoting data. All you want to do is get the minimum and maximum date per ID; a simple aggregation:
select id, name, min(date) as firstdate, max(date) as lastdate
from view_mysource
group by id, name
order by id;

Finding max date difference on a single column

in the below table example - Table A, we have entries for four different ID's 1,2,3,4 with the respective status and its time. I wanted to find the "ID" which took the maximum amount of time to change the "Status" from Started to Completed. In the below example it is ID = 4. I wanted to run a query and find the results, where we currently has approximately million records in a table. It would be really great, if someone provide an effective way to retrieve this data.
Table A
ID Status Date(YYYY-DD-MM HH:MM:SS)
1. Started 2017-01-01 01:00:00
1. Completed 2017-01-01 02:00:00
2. Started 2017-10-02 03:00:00
2. Completed 2017-10-02 05:00:00
3. Started 2017-15-03 06:00:00
3. Completed 2017-15-03 09:00:00
4. Started 2017-22-04 10:00:00
4. Completed 2017-22-04 15:00:00
Thanks!
Bruce

You can query as below:
Select top 1 with ties Id from #yourDate y1
join #yourDate y2
On y1.Id = y2.Id
and y1.[STatus] = 'Started'
and y2.[STatus] = 'Completed'
order by Row_number() over(order by datediff(mi,y1.[Date], y2.[date]) desc)

SELECT
started.ID, timediff(completed.date, started.date) as elapsed_time
FROM TABLE_A as started
INNER JOIN TABLE_A as completed ON (completed.ID=started.ID AND completed.status='Completed')
WHERE started.status='Started'
ORDER BY elapsed_time desc
be sure there's a index on TABLE_A for the columns ID, date

I haven't run this sql but it may solve your problem.
select a.id, max(DATEDIFF(SECOND, a.date, b.date + 1)) from TableA as a
join TableA as b on a.id = b.id
where a.status="started" and b.status="completed"

Here's a way with a correlated sub-query. Just uncomment the TOP 1 to get ID 4 in this case. This is based off your comments that there is only 1 "started" record, but could be multiple "completed" records for each ID.
declare #TableA table (ID int, Status varchar(64), Date datetime)
insert into #TableA
values
(1,'Started','2017-01-01 01:00:00'),
(1,'Completed','2017-01-01 02:00:00'),
(2,'Started','2017-02-10 03:00:00'),
(2,'Completed','2017-02-10 05:00:00'),
(3,'Started','2017-03-15 06:00:00'),
(3,'Completed','2017-03-15 09:00:00'),
(4,'Started','2017-04-22 10:00:00'),
(4,'Completed','2017-04-22 15:00:00')
select --top 1
s.ID
,datediff(minute,s.Date,e.EndDate) as TimeDifference
from #TableA s
inner join(
select
ID
,max(Date) as EndDate
from #TableA
where Status = 'Completed'
group by ID) e on e.ID = s.ID
where
s.Status = 'Started'
order by
datediff(minute,s.Date,e.EndDate) desc
RETURNS
+----+----------------+
| ID | TimeDifference |
+----+----------------+
| 4 | 300 |
| 3 | 180 |
| 2 | 120 |
| 1 | 60 |
+----+----------------+

If you know that 'started' will always be the earliest point in time for each ID and the last 'completed' record you are considering will always be the latest point in time for each ID, the following should have good performance for a large number of records:
SELECT TOP 1
id
, DATEDIFF(s, MIN([Date]), MAX([date])) AS Elapsed
FROM #TableA
GROUP BY ID
ORDER BY DATEDIFF(s, MIN([Date]), MAX([date])) DESC

How to join get results per day without looping in stored procedure

I am so sorry. I did not ask the question correctly. So here it is again:
I have a table containing number of posts, their destinations and their retrieval dates. What I want to do is for a date range and for each date in the range, find the latest number of posts for each destination that is earlier than the date. This needs to be done in a stored procedure and not looping through the date range. Is it possible? Thanks.
For example:
Destination RetrievalDate NumberOfPosts
----------- ------------------ -------------
A 11/1/2011 12:00 AM 1
A 11/1/2011 1:00 AM 5
A 11/1/2011 5:00 AM 6
B 11/1/2011 12:00 AM 0
B 11/1/2011 4:00 AM 2
C 10/20/2011 5:00PM 1
A 11/2/2011 12:00 AM 8
A 11/2/2011 2:00 AM 9
B 11/2/2011 12AM 3
For example, in the above table, if the date range is 11/1/2011 - 11/3/2011, I would get
Destination ReportDate NumberOfPosts
----------- ---------- -------------
A 11/1/2011 6
B 11/1/2011 2
C 11/1/2011 1
A 11/2/2011 9
B 11/2/2011 3
C 11/2/2011 1
A 11/3/2011 9
B 11/3/2011 3
C 11/3/2011 1

Following your edit you can use
declare #start date = '20111101'
declare #end date = '20111103';
WITH Dates(D)
AS (SELECT #start
UNION ALL
SELECT DATEADD(DAY, 1, D)
FROM Dates
WHERE DATEADD(DAY, 1, D) <= #end),
Destinations
As (SELECT DISTINCT Destination
FROM YourTable)
SELECT CA.Destination,
CA.[Number of posts],
dt.D AS ReportDate
FROM Destinations dst
CROSS JOIN Dates dt
CROSS APPLY (SELECT TOP 1 *
FROM YourTable y
WHERE dst.Destination = y.Destination
AND y.[Retrieval Date] <= DATEADD(DAY, 1, dt.D)
ORDER BY y.[Retrieval Date] DESC) CA
Ideally you would substitute another table listing the distinct destinations rather than using the CTE one built from the main table.