Linear Interpolation in MS SQL Server - sql-server

I work with annual mileage for each customer. Years are consecutive in range 2009-2022, so i have no gaps in years and some customers for instance can have records from 2009 to 2020, and so on, but all years are consecutive there are no gaps. The issue with Annual_Mlg is that there are NULL values time to time. NULLs appear randomly: there can be 3 consecutive NULLs at the beginning of range for certain customer or in the middle there can be 4 NULLs, and last 3 values can be NULLs as well. So NULL values are random. I need to have mileage for EVERY SINGLE YEAR. If mileage was 0 that is fine. I just should use it in computation. 0 mileage means customer was not traveling that year at all. So that number is acceptable just like any other valid mileage. By valid mileage i mean non-negative mileage, and table with which i work has ONLY non-negative mileage (so valid mileage). I need to get rid of NULLs and where i have NULLs i use either previous record or next record or interpolation should be used to get rid of NULLs completely. There should be NO default value. I DO HOPE THAT THERE IS A CODE BETTER TO FIX MY PROBLEM. ALSO IF THERE AN EASIER WAY TO FIX PROBLEM IN 'R' (programming language) then I am ALL EARS, however i have not attempted to solve it in 'R' since i suck more in 'R' then in MS SQL.
I have created Previous Mlg and Next Mlg in order to do interpolation if NULL value is sandwiched between non-NULL 2 values. That seems to be working. Annual_mlg this field where the original values are in. Final_mlg this is where i need to have all values as non-NULLs. I used COALESCE to get rid of NULLs to some extent. I am a relatively new in SQL and i have not enough coding skills to tackle this type of problem on my own it seems to be. So if to take a look at what i have inside COALESCE i can explain my logic.
3rd value inside COALESCE deals with case when there are 2 consecutive NULLs somewhere in the middle of 2009-2022 range and then code comes to first NULL it takes previous mileage that is Prev_Mlg, but then for 2nd NULL it takes previous Mileage which i just got and averages it with next non-NULL value.
I use bunch of lead() and lag() those are to account for cases when there are bunch of records with NULLs at either beginning or at the end. This code does not obviously solve all problems and looks like a joke.
I am looking to have more meaningful code. NOTE: there should not be a default value, all NULLs should be replaced and the logic for this goes as follows:
If there is a NULL sandwiched between 2 non-NULLs then code should interpolate;
if there is NULL at the beginning of range then it should be replaced with FIRST non-NULL value in range
by range i mean years from 2009 - 2022
if there is NULL at the end of range then it should be replaced by FIRST non-NULL value (if to count from the end of the range)
Once NULL value is replaced then this should be used in interpolation. I will provide examples what kind of result i need to get.
Any help is GREATLY appreciated!!! THANK YOU SO MUCH!!! I AM REALLY STRUGGLING!!!
SELECT
Customer,
Year,
Final_Mlg = COALESCE(Annual_Mlg, (Prev_Mlg + Next_Mlg)/2,
(lag(Final_Mlg) over (partition by Customer order by Year) + Next_Mlg)/2, Prev_Mlg, Next_Mlg,
lead(Final_Mlg) over (partition by Customer order by Year),
lead(Final_Mlg,2) over (partition by Customer order by Year),
lead(Final_Mlg,3) over (partition by Customer order by Year),
lead(Final_Mlg,4) over (partition by Customer order by Year),
lead(Final_Mlg,5) over (partition by Customer order by Year),
lead(Final_Mlg,6) over (partition by Customer order by Year),
lead(Final_Mlg,7) over (partition by Customer order by Year),
lead(Final_Mlg,8) over (partition by Customer order by Year),
lead(Final_Mlg,9) over (partition by Customer order by Year),
lead(Final_Mlg,10) over (partition by Customer order by Year),
lead(Final_Mlg,11) over (partition by Customer order by Year),
lag(Final_Mlg) over (partition by Customer order by Year),
lag(Final_Mlg,2) over (partition by Customer order by Year),
lag(Final_Mlg,3) over (partition by Customer order by Year),
lag(Final_Mlg,4) over (partition by Customer order by Year),
lag(Final_Mlg,5) over (partition by Customer order by Year),
lag(Final_Mlg,6) over (partition by Customer order by Year),
lag(Final_Mlg,7) over (partition by Customer order by Year),
lag(Final_Mlg,8) over (partition by Customer order by Year),
lag(Final_Mlg,9) over (partition by Customer order by Year),
lag(Final_Mlg,10) over (partition by Customer order by Year),
lag(Final_Mlg,11) over (partition by Customer order by Year)),
Annual_Mlg,
Prev_Mlg,
Next_Mlg
FROM #table2
ORDER BY
Customer,
Year
CASE 1: couple of values are consecutive NULLs at the BEGINNING of range. There is a desired result below.
Year
Customer
Annual_Mileage
2009
A
NULL(Should be Replaced by 3)
2010
A
NULL(Should be Replaced by 3)
2011
A
NULL(Should be Replaced by 3)
2012
A
3
2013
A
4
2014
A
5
2015
A
6
2016
A
7
2017
A
8
2018
A
9
2019
A
10
2020
A
11
2021
A
12
2022
A
13
CASE 2: couple of values are consecutive NULLs at the END of range.
Year
Customer
Annual_Mileage
2009
A
3
2010
A
3
2011
A
3
2012
A
3
2013
A
4
2014
A
5
2015
A
6
2016
A
7
2017
A
8
2018
A
9
2019
A
10
2020
A
NULL(Should be Replaced by 10)
2021
A
NULL(Should be Replaced by 10)
2022
A
NULL(Should be Replaced by 10)
CASE 3: There are some NULLs in between non-NULL values.
Year
Customer
Annual_Mileage
2009
A
1
2010
A
NULL (Should be replaced by 1.5)
2011
A
2
2012
A
3
2013
A
4
2014
A
NULL (Should be replaced by 4.5)
2015
A
5
2016
A
NULL (Should be replaced by 5.5)
2017
A
6
2018
A
NULL (Should be replaced by 6)
2019
A
NULL (Should be replaced by 6.5)
2020
A
7
2021
A
8
2022
A
8

Related

How to find difference between dates and find first purchase in an eCommerce database

I am using Microsoft SQL Server Management Studio. I am trying to measure the customer retention rate of an eCommerce site.
For this, I need four values:
customer_id
order_purchase_timestamp
age_by_month
first_purchase
The values of age_by_month and first_purchase are not in my database. I want to calculate them.
In my database, I have customer_id and order_purchase_timestamp.
The first_purchase should be the earliest instance of order_purchase_timestamp. I only want the month and year.
The age_by_month should be the difference of months from first_purchase to order_purchase_timestamp.
I only want to measure the retention of the customer for each month so if two purchases are made in the same month it shouldn't be shown.
the dates are between 2016-10-01 to 2018-09-30. it should be ordered by order_purchase_timestamp
An example
customer_id
order_purchase_timestamp
1
2016-09-04
2
2016-09-05
3
2016-09-05
3
2016-09-15
1
2016-10-04
to
customer_id
first_purchase
age_by_month
order_purchase_timestamp
1
2016-09
0
2016-09-04
2
2016-09
0
2016-09-05
3
2016-09
0
2016-09-05
1
2016-09
1
2016-10-04
What I have done
SELECT
customer_id, order_purchase_timestamp
FROM
orders
WHERE
(order_purchase_timestamp BETWEEN '2016-10-01' AND '2016-12-31')
OR (order_purchase_timestamp BETWEEN '2017-01-01' AND '2017-03-31')
OR (order_purchase_timestamp BETWEEN '2017-04-01' AND '2017-06-30')
OR (order_purchase_timestamp BETWEEN '2017-07-01' AND '2017-09-30')
OR (order_purchase_timestamp BETWEEN '2017-10-01' AND '2017-12-31')
OR (order_purchase_timestamp BETWEEN '2018-01-01' AND '2018-03-31')
OR (order_purchase_timestamp BETWEEN '2018-04-01' AND '2018-06-30')
OR (order_purchase_timestamp BETWEEN '2018-07-01' AND '2018-09-30')
ORDER BY
order_purchase_timestamp
Originally I was going to do it by quarters but I want to do it in months now.
The following approach is designed to be relatively easy to understand. There are other ways (e.g., windowed functions) that may be marginally more efficient; but this makes it easy to maintain at your current SQL skill level.
Note that the SQL commands below build on one another (so the answer is at the end). To follow along, here is a db<>fiddle with the working.
It's based around a simple query (which we'll use as a sub-query) that finds the first order_purchase_timestamp for each customer.
SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders
GROUP BY customer_id
The next thing is DATEDIFF to find the difference between 2 dates.
Then, you can use the above as a subquery to get the first date onto each row - then find the date difference e.g.,
SELECT orders.customer_id,
orders.order_purchase_timestamp,
first_purchases.first_purchase_date,
DATEDIFF(month, first_purchases.first_purchase_date, orders.order_purchase_timestamp) AS age_by_month
FROM orders
INNER JOIN
(SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders
GROUP BY customer_id
) AS first_purchases ON orders.customer_id = first_purchases.customer_id
Note - DATEDIFF has a 'gotcha' that gets most people but is good for you - when comparing months, it ignores the day component e.g., if finding the difference in months, there is 0 difference in months between 1 Jan and 31 Jan. On the other hand, there will be a difference on 1 month between 31 Jan and 1 Feb. However, I think this is actually what you want!
The above, however, repeats when a customer has multiple purchases within the month (it has one row per purchase). Instead, we can GROUP BY to group by the month it's in, then only take the first purchase for that month.
A 'direct' approach to this would be to group on YEAR(orders.order_purchase_timestamp) AND MONTH(orders.order_purchase_timestamp). However, I use a little trick below - using EOMONTH which finds the last day of the month. EOMONTH returns the same date for any date in that month; therefore, we can group by that.
Finally, you can add the WHERE expression and ORDER BY to get the results you asked for (between the two dates)
SELECT orders.customer_id,
MIN(orders.order_purchase_timestamp) AS order_purchase_timestamp,
first_purchases.first_purchase_date,
DATEDIFF(month, first_purchases.first_purchase_date, EOMONTH(orders.order_purchase_timestamp)) AS age_by_month
FROM orders
INNER JOIN
(SELECT customer_id, MIN(order_purchase_timestamp) AS first_purchase_date
FROM orders AS orders_ref
GROUP BY customer_id
) AS first_purchases ON orders.customer_id = first_purchases.customer_id
WHERE orders.order_purchase_timestamp BETWEEN '20161001' AND '20180930'
GROUP BY orders.customer_id, first_purchases.first_purchase_date, EOMONTH(orders.order_purchase_timestamp)
ORDER BY order_purchase_timestamp;
Results - note they are different from yours because you wanted the earliest date to be 1/10/2016.
customer_id order_purchase_timestamp first_purchase_date age_by_month
1 2016-10-04 00:00:00.000 2016-09-04 00:00:00.000 1
Edit: Because someone else will do it like this otherwise!
You can do this with a single read-through that will potentially run a little faster. It is also a bit shorter - but harder to understand imo.
The below uses windows functions to calculate both the customer's earliest purchase, and the earliest purchase for each month (and uses DISTINCT rather than a GROUP BY). With that, it just does the DATEDIFF to calculate the difference.
WITH monthly_orders AS
(SELECT DISTINCT orders.customer_id,
MIN(orders.order_purchase_timestamp) OVER (PARTITION BY orders.customer_id, EOMONTH(orders.order_purchase_timestamp)) AS order_purchase_timestamp,
MIN(orders.order_purchase_timestamp) OVER (PARTITION BY orders.customer_id) AS first_purchase_date
FROM orders)
SELECT *, DATEDIFF(month, first_purchase_date, order_purchase_timestamp) AS age_by_month
FROM monthly_orders
WHERE order_purchase_timestamp BETWEEN '20161001' AND '20180930';
Note however this has one difference in the results. If you have 2 orders in a month, and your lowest date filter is between the to (e.g., orders on 15/10 and 20/10, and your minimum date is 16/10) then the row won't be included as the earliest purchase in the month is outside the filter range.
Also beware with both of these and what type of date or datetime field you are using - if you have datetimes rather than just dates, BETWEEN '20161001' AND '20180930' is not the same as >= '20161001' AND < '20181001'
Here is short query that achieves all you want (descriptions of methods used are inline):
declare #test table (
customer_id int,
order_purchase_timestamp date
)
-- some test data
insert into #test values
(1, '2016-09-04'),
(2, '2016-09-05'),
(3, '2016-09-05'),
(3, '2016-09-15'),
(1, '2016-10-04');
select
customer_id,
-- takes care of correct display of first_purchase
format(first_purchase, 'yyyy-MM') first_purchase,
-- used to get the difference in months
datediff(m, first_purchase, order_purchase_timestamp) age_by_month,
order_purchase_timestamp
from (
select
*,
-- window function used to find min value for given column within group
-- for each row
min(order_purchase_timestamp) over (partition by customer_id) first_purchase
from #test
) a

Create function to filter out data that meets a specific condition in SQL Server 2012

I'm creating a report in SQL Server 2012. The report is pulling out discharge patients information from database. The report only needs to show patients the third admission day information.
For example, patient A was admitted into the hospital on January 01/2016, (stayed for 4 days) January 02, January 03, January 04 and was discharged on January 05/2016.
Patient B was admitted into the hospital on January 15/2016, (stayed for 6 days) January 16, January 17, January 18, January 19, January 20 and was discharged on January 21/2016.
The report only needs to show the third day's (January 03 and January 17) information.
How to write the function to filter out the data only show necessary data? Any suggestions?
Greatly appreciate your big help!
Thanks!
Rose
You can use a combination of Row_Number() and Partition by. Basically you partition by the patient and you row_number based on the date ascending.
Select ... ROW_NUMBER() over(PARTITION BY patientId Order by TheDate)
Then you can use that above as a sub query and filter where t.theRow = 3.
Sorry I'm on an iPod so it's difficult to provide more info. I'll try to clean this up or format it tomorrow.
Edit
Now that I'm on a PC, here's what you can do per above
SELECT
t.PatientID,
t.TheNumber
FROM
(
Select
PatientID,
ROW_NUMBER() over(PARTITION BY PatientID ORDER BY YourDateField) AS TheNumber
FROM
YourTable
) t
WHERE
t.TheNumber = 3

sum up every 12 months in a table

i have a table calculating the installments. in that table i'm saving all the data recording to that. For example if i'm calculating for 60 installments and saving all the data,so it is like 60 months. so now i need to sum up the value of one column for every 12 months. sometimes v start paying the installments from the middle of the year also.
my DB looks like this.the highlighted column must sum up for every 12 months. two images are one table only
suppose i have 30 installments from starting on jun 2012.suppose i started paying installment from jun 2012 then should sum up the installments from jun 2012 to may 2013. v can't use group by year. i must sum up like this ................................................................................‌​
sum jun 2012 to may 2013
sum jun 2013 to may 2014
sum jun 2014 to nov 2014 ( only 6 months left)
You can use ROW_NUMBER to generate a group of 12 months:
WITH Cte AS(
SELECT *,
RN = (ROW_NUMBER() OVER(ORDER BY InstallmentMonth) - 1)/ 12
FROM your_table
)
SELECT
SUM(InteresetPerInstallment)
FROM Cte
GROUP BY RN

Calculate payments on a month by month bases from date opened SQL 2008

I'm currently trying to replicate an old report that used to produce a rolling sum of collections. However it wasn't a standard month on month. Here is screen shot of the excel based report.
The blue section is based on a simple query and gives the dataset used to start(EXAMPLE):
SELECT COUNT(AccountNo) AS Number, SUM(Balance) AS Value, DATENAME(MM,DateOpened) AS Month, DATEPART(Y,DateOpened) AS Year FROM tblAccounts
GROUP BY DATENAME(MM,DateOpened), DATEPART(Y,DateOpened)
The tables are very basic :
AccountNo | Balance | DateOpened
12345 | 1245.55 | 01/01/2015
I'm struggling to get it to work out the months on a rolling basis, so Month 1 for Apr 2011 will be the first month for those files (payments in April), month 2 would be payments in May for the accounts opened in April (I hope that is clear).
So this means Month 1 for April would be April, and Month 1 for Nov would be Nov. Payments are stored in a table tblPayments
AccountNo | DatePayment | PaymentValue
12345 | 02/02/2015 | 15.99
Please ask if I haven't been clear enough
Assuming you have a column called "DatePayment", you should simply do something like this:
SELECT COUNT(AccountNo) AS Number, SUM(Balance) AS Value,
DATENAME(MM,DateOpened) AS Month, DATEPART(Y,DateOpened) AS Year,
DATEDIFF(MONTH, DateOpened, DatePayment) AS MonthN
FROM [...]
GROUP BY DATENAME(MM,DateOpened), DATEPART(Y,DateOpened),
DATEDIFF(MONTH, DateOpened, DatePayment)
The DATEDIFF simply counts the months between the date the account was opened and the date of the payment. Note that you might want to change the DateOpened to always be the 1st of the month in the DATEDIFF calculation.
In the FROM [...] part of your query, you will need a join between your Payments-table and the table holding your accounts, in order to be able to compare DateOpened with DatePayment. You should join them on the AccountNo-column. This looks something like this:
FROM Accounts INNER JOIN Payments ON Accounts.AccountNo = Payments.AccountNo
After doing this, you will need to make sure that all references to columns that exist in both tables are fully qualified. This means that COUNT(AccountNo) should be changed to COUNT(Accounts.AccountNo), etc.

OrderBy clause sorting issue?

I am trying to order by Date in SQL Server I am facing a weird issue. There are 2 cases I should explain to help you understand my issue .
Case 1 :
select [MonthName]
from prod.[dim date]
where F_Year = 2014
order by [Date]
My output:
june june june july july july july august
// Here I get duplicates but order by is working as expected
Case 2 : tried to remove duplicates by using "distinct"
select distinct [MonthName]
from prod.[dim date]
where F_Year = 2014
order by [Date]
My output:
August july june
// Order By not working as expected (ordering alphabetical wise ) .
Any workaround is appreciated
Firstly your query seems to be working off a table with 3 fields representing possibly a single date or maybe 2 - hard to tell without the structure but I can see : F_Year, [Date] & [MonthName]. So if these actually represent a single date, then revert to that single date and use formatting to determine F_Year and MonthName.
Secondly check out these posts for a probable answer to your question:
Sort by Date in SQL
Convert Month Number to Month Name Function in SQL

Resources