Snowflake: Dateadd only weekdays - snowflake-cloud-data-platform

Snowflake: Dateadd only weekdays - snowflake-cloud-data-platform

Is it possible to add only weekdays in a date function?
dateadd(day, 10, business_date)
Instead of returning next 10 days, is it possible to retrieve next 10 weekdays?
Regards,
Sridar

There are some semi complicated functions out there in other languages for this that could be converted, but without knowing more, I'd generally recommend the method of creating a calendar table.
In that table, you can label dates as weekdays and then join and filter with that table.
This also lets you extend to holidays with an additional IsHoliday flag
Then you can join to lists of dates with queries like this
SELECT DateColumnValue, RANK() OVER(ORDER BY DATEKEY) RNK
FROM DIMDATE
WHERE DateColumnValue >= CURRENT_DATE
AND isWeekday = 1
AND isHoliday = 0
QUALIFY RNK <= 10

Related

How to find difference between dates and find first purchase in an eCommerce database

I am using Microsoft SQL Server Management Studio. I am trying to measure the customer retention rate of an eCommerce site.
For this, I need four values:
customer_id
order_purchase_timestamp
age_by_month
first_purchase
The values of age_by_month and first_purchase are not in my database. I want to calculate them.
In my database, I have customer_id and order_purchase_timestamp.
The first_purchase should be the earliest instance of order_purchase_timestamp. I only want the month and year.
The age_by_month should be the difference of months from first_purchase to order_purchase_timestamp.
I only want to measure the retention of the customer for each month so if two purchases are made in the same month it shouldn't be shown.
the dates are between 2016-10-01 to 2018-09-30. it should be ordered by order_purchase_timestamp
An example
customer_id
order_purchase_timestamp
1
2016-09-04
2
2016-09-05
3
2016-09-05
3
2016-09-15
1
2016-10-04
to
customer_id
first_purchase
age_by_month
order_purchase_timestamp
1
2016-09
0
2016-09-04
2
2016-09
0
2016-09-05
3
2016-09
0
2016-09-05
1
2016-09
1
2016-10-04
What I have done
SELECT
customer_id, order_purchase_timestamp
FROM
orders
WHERE
(order_purchase_timestamp BETWEEN '2016-10-01' AND '2016-12-31')
OR (order_purchase_timestamp BETWEEN '2017-01-01' AND '2017-03-31')
OR (order_purchase_timestamp BETWEEN '2017-04-01' AND '2017-06-30')
OR (order_purchase_timestamp BETWEEN '2017-07-01' AND '2017-09-30')
OR (order_purchase_timestamp BETWEEN '2017-10-01' AND '2017-12-31')
OR (order_purchase_timestamp BETWEEN '2018-01-01' AND '2018-03-31')
OR (order_purchase_timestamp BETWEEN '2018-04-01' AND '2018-06-30')
OR (order_purchase_timestamp BETWEEN '2018-07-01' AND '2018-09-30')
ORDER BY
order_purchase_timestamp
Originally I was going to do it by quarters but I want to do it in months now.

The following approach is designed to be relatively easy to understand. There are other ways (e.g., windowed functions) that may be marginally more efficient; but this makes it easy to maintain at your current SQL skill level.
Note that the SQL commands below build on one another (so the answer is at the end). To follow along, here is a db<>fiddle with the working.
It's based around a simple query (which we'll use as a sub-query) that finds the first order_purchase_timestamp for each customer.
SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders
GROUP BY customer_id
The next thing is DATEDIFF to find the difference between 2 dates.
Then, you can use the above as a subquery to get the first date onto each row - then find the date difference e.g.,
SELECT orders.customer_id,
orders.order_purchase_timestamp,
first_purchases.first_purchase_date,
DATEDIFF(month, first_purchases.first_purchase_date, orders.order_purchase_timestamp) AS age_by_month
FROM orders
INNER JOIN
(SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders
GROUP BY customer_id
) AS first_purchases ON orders.customer_id = first_purchases.customer_id
Note - DATEDIFF has a 'gotcha' that gets most people but is good for you - when comparing months, it ignores the day component e.g., if finding the difference in months, there is 0 difference in months between 1 Jan and 31 Jan. On the other hand, there will be a difference on 1 month between 31 Jan and 1 Feb. However, I think this is actually what you want!
The above, however, repeats when a customer has multiple purchases within the month (it has one row per purchase). Instead, we can GROUP BY to group by the month it's in, then only take the first purchase for that month.
A 'direct' approach to this would be to group on YEAR(orders.order_purchase_timestamp) AND MONTH(orders.order_purchase_timestamp). However, I use a little trick below - using EOMONTH which finds the last day of the month. EOMONTH returns the same date for any date in that month; therefore, we can group by that.
Finally, you can add the WHERE expression and ORDER BY to get the results you asked for (between the two dates)
SELECT orders.customer_id,
MIN(orders.order_purchase_timestamp) AS order_purchase_timestamp,
first_purchases.first_purchase_date,
DATEDIFF(month, first_purchases.first_purchase_date, EOMONTH(orders.order_purchase_timestamp)) AS age_by_month
FROM orders
INNER JOIN
(SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders AS orders_ref
GROUP BY customer_id
) AS first_purchases ON orders.customer_id = first_purchases.customer_id
WHERE orders.order_purchase_timestamp BETWEEN '20161001' AND '20180930'
GROUP BY orders.customer_id, first_purchases.first_purchase_date, EOMONTH(orders.order_purchase_timestamp)
ORDER BY order_purchase_timestamp;
Results - note they are different from yours because you wanted the earliest date to be 1/10/2016.
customer_id order_purchase_timestamp first_purchase_date age_by_month
1 2016-10-04 00:00:00.000 2016-09-04 00:00:00.000 1
Edit: Because someone else will do it like this otherwise!
You can do this with a single read-through that will potentially run a little faster. It is also a bit shorter - but harder to understand imo.
The below uses windows functions to calculate both the customer's earliest purchase, and the earliest purchase for each month (and uses DISTINCT rather than a GROUP BY). With that, it just does the DATEDIFF to calculate the difference.
WITH monthly_orders AS
(SELECT DISTINCT orders.customer_id,
MIN(orders.order_purchase_timestamp) OVER (PARTITION BY orders.customer_id, EOMONTH(orders.order_purchase_timestamp)) AS order_purchase_timestamp,
MIN(orders.order_purchase_timestamp) OVER (PARTITION BY orders.customer_id) AS first_purchase_date
FROM orders)
SELECT *, DATEDIFF(month, first_purchase_date, order_purchase_timestamp) AS age_by_month
FROM monthly_orders
WHERE order_purchase_timestamp BETWEEN '20161001' AND '20180930';
Note however this has one difference in the results. If you have 2 orders in a month, and your lowest date filter is between the to (e.g., orders on 15/10 and 20/10, and your minimum date is 16/10) then the row won't be included as the earliest purchase in the month is outside the filter range.
Also beware with both of these and what type of date or datetime field you are using - if you have datetimes rather than just dates, BETWEEN '20161001' AND '20180930' is not the same as >= '20161001' AND < '20181001'

Here is short query that achieves all you want (descriptions of methods used are inline):
declare #test table (
customer_id int,
order_purchase_timestamp date
)
-- some test data
insert into #test values
(1, '2016-09-04'),
(2, '2016-09-05'),
(3, '2016-09-05'),
(3, '2016-09-15'),
(1, '2016-10-04');
select
customer_id,
-- takes care of correct display of first_purchase
format(first_purchase, 'yyyy-MM') first_purchase,
-- used to get the difference in months
datediff(m, first_purchase, order_purchase_timestamp) age_by_month,
order_purchase_timestamp
from (
select
*,
-- window function used to find min value for given column within group
-- for each row
min(order_purchase_timestamp) over (partition by customer_id) first_purchase
from #test
) a

How do I dynamically generate dates between two dates in Snowflake?

I've been searching for a good generate_series analog in Snowflake but what I've found so far is a bit limiting in scope. Most of the examples I've seen use rowcount but I need something more dynamic than that.
I have these columns:
location_id, subscription_id, start_date, end_date
The datediff of the date columns is usually a year but there are many instances where it isn't so I need to account for that.
How do I generate a gapless date range between my start and end dates?
Thank you!

There are several ways to approach this, but here's the way I do it with SQL Generator function Datespine_Groups.
The reason I like to do it this way, is because its flexible enough that I can add weekly, hourly, or monthly intervals between the dates and reuse the code.
The parameter group bounds changes the way the join happens in a subtle way that allows you to control how the dates get filtered out:
global - every location_id, subscription_id combination will start on the same start_date
local - every location_id, subscription_id has their own start/end dates based on the first and last values in the date column
mixed - every location_id, subscription_id has their own start/end dates, but they all share the same end date
Rather than try and make it perfect in 1 query, I think it's probably easier to generate it with mixed and then filter out where the group_start_date occurs after the end_date of your original data.
Here's the SQL. At the very beginning you can either (1) find a way to dynamically generate the 3 parameters, or (2) hard code a ridiculous range that'll last your career and let the rest of the query filter them out :)
You can change month to another datepart, I only assumed you were looking for monthly.
WITH GLOBAL_SPINE AS (
SELECT
ROW_NUMBER() OVER (
ORDER BY
NULL
) as INTERVAL_ID,
DATEADD(
'month',
(INTERVAL_ID - 1),
'2018-01-01T00:00' :: timestamp_ntz
) as SPINE_START,
DATEADD(
'month', INTERVAL_ID, '2018-01-01T00:00' :: timestamp_ntz
) as SPINE_END
FROM
TABLE (
GENERATOR(ROWCOUNT => 2192)
)
),
GROUPS AS (
SELECT
location_id,
subscription_id,
MIN(start_date) AS LOCAL_START,
MAX(start_date) AS LOCAL_END
FROM
My_First_Table
GROUP BY
location_id,
subscription_id
),
GROUP_SPINE AS (
SELECT
location_id,
subscription_id,
SPINE_START AS GROUP_START,
SPINE_END AS GROUP_END
FROM
GROUPS G CROSS
JOIN LATERAL (
SELECT
SPINE_START,
SPINE_END
FROM
GLOBAL_SPINE S
WHERE
S.SPINE_START >= G.LOCAL_START
)
)
SELECT
G.location_id AS GROUP_BY_location_id,
G.subscription_id AS GROUP_BY_subscription_id,
GROUP_START,
GROUP_END,
T.*
FROM
GROUP_SPINE G
LEFT JOIN My_First_Table T ON start_date >= G.GROUP_START
AND start_date < G.GROUP_END
AND G.location_id = T.location_id
AND G.subscription_id = T.subscription_id

How to do WHERE <before> an aggregate function (Postgres)

It's hard to explain from the title, but this is my SQL:
SELECT
SUM("payments"."amount"),
"invoices"."property_id"
FROM "payments"
JOIN "invoices"
ON "payments"."invoice_id" = "invoices"."id"
GROUP BY "property_id"
It returns the sum of all Payment records (amount column) for a particular Property (which is connected through it's invoices).
In other words:
Property has_many: :invoices
Invoice has_one: :payment
I'm trying to select payments between a particular date range though, but it has to happen "before" the aggregate function (so do the exact query above, but only for 2017-01-01 through 2017-02-01). The field would be generated_at on Payment

You are looking for a WHERE clause. (WHERE is executed before aggregation; HAVING is executed after.) Suggested date literals in PostgreSQL are ANSI standard DATE 'YYYY-MM-DD'. Date ranges are usually checked with >= start day and < end day + 1 (in order to deal properly with the time part if any).
SELECT
SUM(p.amount),
i.property_id
FROM payments p
JOIN invoices i ON p.invoice_id = i.id
WHERE p.generated_at >= DATE '2017-01-01'
AND p.generated_at < DATE '2017-02-02'
GROUP BY i.property_id;

SQL Server: selecting a year of account based on a specific date and a date range

I need to apportion some values to a financial year that begins on the 1st December and ends on the 30th November each year.
The rows that contain the value fields are in a table (TABLE A) that has a reference number and an incident date
Table A
ReferenceNumber, Value, IncidentDate
1, 10.00, 01/12/14
2, 15.00, 10/05/13
3, 20.00, 14/10/13
TABLE A is the joined to TABLE B which also has the reference number and contains transactional data including a start date field. Each reference number may have several transactions with different start date values and the aim is to ensure the row selected from TABLE B is the one where the start date is the most recent start date before the incident date from table A
TABLE B
ReferenceNumber, StartDate
1, 01/05/14
1, 01/05/15
2, 12/04/14
2, 12/04/15
3, 05/06/14
3, 04/06/15
TABLE C is a time table that apportions specific dates to financial years.
TABLE C
Date, FinancialYear
30/11/14, FY2013/14
01/12/14, FY2014/15
I am trying to construct a query which joins table A to table B on the Reference number and incident date to start date as described above and then adds the FinancialYear value based on the start date from Table B.
I am struggling to get this to return the correct financial year.
In addition, the data quality is poor so there are many examples where the Incident date from table A is greater than the scope of the financial year selected based on the start date from table B.
I need to be able to return either the appropriate financial year based on start date or, failing that, the financial year corresponding to the incident date
Here is the code I currently have:
SELECT a.ReferenceNumber,
b.StartDate,
c.FinancialYear
FROM dbo.TableA a
INNER JOIN dbo.TableB b
ON a.ReferenceNumber = b.ReferenceNumber
AND b.StartDate = (SELECT MIN(StartDate) FROM dbo.TableB WHERE a.IncidentDateTime > StartDate AND ReferenceNumber = a.ReferenceNumber)
INNER JOIN dbo.Calendar c
ON rdc.PolicyStartDate = c.[Date]

select
a.ReferenceNumber,
min(Value) as Value,
min(IndicentDate) as IncidentDate,
max(StartDate) as StartDate /* others are dummy aggregates but this one is not */
'FY'
+ cast(year(dateadd(month, -11, min(IncidentDate))) as char(4))
+ '/'
+ cast(year(dateadd(month, -11, min(IncidentDate))) - 1999 as char(2)) as FY
from
TableA a cross apply
(
select * from TableB b
where b.ReferenceNumber = a.Reference.Number and b.StartDate < a.IncidentDate
) b
group by a.ReferenceNumber
Your fiscal year starts eleven months "late" so it's easy to determine where a date falls without a lookup.
year(dateadd(month, -11, <date>))
Getting it to match your "FY2013/14" format takes a little extra work but you could write little functions to do these kinds of calculations. By the way, the 1999 comes from adding 1 and subtracting 2000 to get a two-digit year value. Could use modulo 100 to make it generic beyond the year 2098 if that's important.

My assumptions going in:
IncidentDate and StartDate are datatype "DATE". This should also work if they are DATETIME with all time values set the same.
TableC contains a row for every possible date (which is what you implied). Another style would be {FinancialYear, FirstDate, LastDate}, and you'd join to this table using between in the on clause.
I didn't quite get what you meant regarding "the data quality is poor". This query will pull back the desired IncidentDate and StartDate
(if available), allowing you to apply business logic to them. My sample here is "if there is no applicable StartDate, base the FinancialYear on IncidentDate. (Replace those outer joins with inner joins if the data permits it.)
Toss in parameters if you dont' want this data for all ReferenceNumbers.
Check for syntax errors, I couldn't run and test this query.
(Note that "Date" is a confusing name for a column.)
WITH ctePart1 (ReferenceNumber, IncidentDate, ClosestStartDate)
as (-- Data based on the join to "most recent prior StartDate"
select
ta.ReferenceNumber
,ta.IncidentDate
,max(tb.StartDate)
from TableA ta
left outer join TableB tb
on tb.ReferenceNumber = ta.ReferenceNumber
and tb.StartDate < ta.IncidentDate
group by
ta.ReferenceNumber
,ta.IncidentDate)
select
cte.ReferenceNumber
,cte.IncidentDate
,cte.ClosestStartDate
,isnull(tcStart.FinancialYear, tcIncident.FinancialYear) FinancialYear
from ctePart1 cte
left outer join TableC tcStart
on tcStart.Date = cte.ClosestStartDate
left outer join TableC tcIncident
on tcIncident.Date = cte.IncidentDate

Factoring public holidays in to a SQL code

Apologies if this is a simple one. I'm looking for some help with the following:
SELECT *
FROM (
SELECT TOP 7
RIGHT (CONVERT (VARCHAR, CompletedDate, 108), 8) AS Time,
WorkType
FROM Table
WHERE WorkType = 'WorkType1'
OR DATEPART (DW, CompletedDate) IN ('7','1')
AND WorkType = 'WorkType2'
ORDER BY CompletedDate DESC) Table
ORDER BY CompletedDate ASC
Multiple events run every day, and the above searches for the last one scheduled to run each day, and pulls the time from it for the past 7 days. This time marks the end of the day's events, and is the value I'm after.
Events run at a different order on weekends, so I search for a different WorkType. WorkType 1 is unique to weekdays. WorkType2 is run both at weekdays and weekends, however it is not the final event on a weekday, so I don't search for it then.
However, this kind of falls apart when public holidays such as bank holidays come into play, as they use the weekend timings. I still need to capture these times, but the above skips over them. If I were to remove or expand the DATEPART, I would end up with duplicate values for each day that don't mark the final job of the day.
What changes can I make to this to capture these lost holiday timings, without manually going through and checking every holiday? Is there a way that I can return a value for JobType2, if JobType1 does not appear on a day?

I suggest a materialized calendar table with one row per date along with the desired WorkType for that day. That will allow you to simply join on to the calendar table to determine the proper WorkType value without embedding the logic in the query itself.
With this table loaded with all dates for your reporting domain:
CREATE TABLE dbo.WorkTypeCalendar(
CalendarDate date NOT NULL
CONSTRAINT PK_Calendar PRIMARY KEY CLUSTERED
, WorkType varchar(10) NOT NULL
);
GO
The query can be refactored as below:
SELECT *
FROM ( SELECT TOP 7
RIGHT(CONVERT (varchar, CompletedDate, 108), 8) AS Time
, WorkType
FROM Table1 AS t
JOIN WorkTypeCalendar AS c ON t.WorkType = c.WorkType
AND t.CompletedDate >= c.CalendarDate
AND t.CompletedDate < DATEADD(DAY,
1,
c.CalendarDate)
ORDER BY CompletedDate DESC
) Table1
ORDER BY CompletedDate ASC
You also might consider making this a generalized utility calendar table. See http://www.dbdelta.com/calendar-table-and-datetime-functions/ for an complete example of such a table and script to load US holidays you can adjust for your needs and locale.