Calculating Percentile on large SQL tables

Calculating Percentile on large SQL tables - sql-server

I have a large SQL table with 100+ million records in SQL Server as follows:
+-----------+----------+----------+----------+----------+
|CustomerID |TransDate |Category | Num_Trans |Sum_Trans |
+-----------+----------+----------+----------+----------+
|457136432 |2022-12-31|TAXI |18 |220.34 |
|863326783 |2022-12-31|FOOD |76 |980.71 |
+-----------+----------+----------+----------+----------+
this table contains number of transactions and sum of transaction values for each category in a 6-month window prior to "TransDate" per Customer.
I need to calculate some percentile calculations for a reporting task (query below)
SELECT DISTINCT TransDate, Category,
, PERCENTILE_CONT (0.05) WITHIN GROUP (ORDER BY Num_Trans) OVER (PARTITION BY TransDate, Category)
AS P_LOWER_NUM
, PERCENTILE_CONT (0.5) WITHIN GROUP (ORDER BY Num_Trans) OVER (PARTITION BY TransDate, Category)
AS P_MEDIAN_NUM
, PERCENTILE_CONT (0.95) WITHIN GROUP (ORDER BY Num_Trans) OVER (PARTITION BY TransDate, Category)
AS P_UPPER_NUM
, PERCENTILE_CONT (0.05) WITHIN GROUP (ORDER BY Sum_Trans) OVER (PARTITION BY TransDate, Category)
AS P_LOWER_SUM
, PERCENTILE_CONT (0.5) WITHIN GROUP (ORDER BY Sum_Trans) OVER (PARTITION BY TransDate, Category)
AS P_MEDIAN_SUM
, PERCENTILE_CONT (0.95) WITHIN GROUP (ORDER BY Sum_Trans) OVER (PARTITION BY TransDate, Category)
AS P_UPPER_SUM
FROM large_tbale
with the following non-clustered index on the table
CREATE NONCLUSTERED INDEX MyIndex
ON large_table (TransDate, Category)
INCLUDE (Num_Trans, Sum_Trans);
unfortunately, the table creation takes over 30 minutes which is not acceptable for my application.
to boost the performance, I even partitioned the table based on "TransDate" and created index on the partitions. but the execution time does not change significantly by table partitioning either.
Is there anyway to speed up these calculations?
PS- looking at the execution plan, it seems the problem is the fact that for each percentile_cont calculation, the query needs to sort the records. Is it possible to somehow sort "Num_Trans" and "Sum_trans" columns once and then do percentile calculations?

Related

Creating a table with percentiles from raw database

I have a database with 160 million something records in it, that is segmented by an code called TMC. The TMC represents a section of highway that is then measured every five minutes for speed and travel time. So the TMC isn't a unique identifier as it the same for every five minutes for all days of the year for that one section of highway. There are 3,440 unique TMCs, as for each TMC, I am trying to calculate a percentile of travel times for an entire year for a specific time of day.
I can get the code for the percentiles to work, but I do not understand how to create and update a table in SQL so the percentiles can be dumped and stored within it. Something to do with the with statement being used to get the percentile does not mesh well with update functions. I normally just use select and copy the data into excel, and then reimport the data into my SQL database, but I am trying to see if I can automate this process as much as possible.
Here is the code that I got so far.
create table TMCF5 (
TMC_code varchar(50),
P95M varchar(50),
P50M varchar(50),
P95A varchar(50),
P50A varchar(50))
go
WITH PERCENTILES_Afternoon AS (SELECT TMC_code, EPOCH, percentile_CONT(.95)
WITHIN GROUP (ORDER BY cast(travel_time_minutes as float)) OVER (PARTITION BY TMC_code) AS P95afternoon, percentile_CONT(.50)
WITHIN GROUP (ORDER BY cast(travel_time_minutes as float)) OVER (PARTITION BY TMC_code) AS P50afternoon FROM [dbo].[AR_2018_TRUCKS_1_3]
WHERE DATEPART(HOUR, EPOCH) between 16 and 17 AND (WKDAY != 'SAT' and WKDAY != 'SUN'))
insert tmcf5 (tmc_code) select tmc_code from percentiles_afternoon group by tmc_code
go
WITH PERCENTILES_Afternoon2 AS (SELECT TMC_code, EPOCH, percentile_CONT(.95)
WITHIN GROUP (ORDER BY cast(travel_time_minutes as float)) OVER (PARTITION BY TMC_code) AS P95afternoon, percentile_CONT(.50)
WITHIN GROUP (ORDER BY cast(travel_time_minutes as float)) OVER (PARTITION BY TMC_code) AS P50afternoon FROM [dbo].[AR_2018_TRUCKS_1_3]
WHERE DATEPART(HOUR, EPOCH) between 16 and 17 AND (WKDAY != 'SAT' and WKDAY != 'SUN'))
update TMCF5 set tmcF5.p95A = percentiles_Afternoon2.P95Afternoon from percentiles_afternoon2
join percentiles_afternoon2 on tmcf5.tmc_code = percentiles_afternoon2.tmc_code

The text in your error message that you provided in the comments leads me to identify the problem as being present in this statement:
update TMCF5 set tmcF5.p95A = percentiles_Afternoon2.P95Afternoon from percentiles_afternoon2
join percentiles_afternoon2 on tmcf5.tmc_code = percentiles_afternoon2.tmc_code
It has 'percentiles_afternoon2' listed as both tables. It seems you wanted to reference 'tmcf5' as one of your objects.
Also, if your first insert statement purely serves the purpose of bringing in your tmc_codes, then just simplify the query to:
-- no cte's
insert tmcf5 (tmc_code)
select
distinct tmc_code
from AR_2018_TRUCKS_1_3;

The field in ORDER BY affects the result of window functions

I have simple T-SQL query, which calculates row number, rows count and total volume across all records:
DECLARE #t TABLE
(
id varchar(100),
volume float,
prev_date date
);
INSERT INTO #t VALUES
('0318610084', 100, '2019-05-16'),
('0318610084', 200, '2016-06-04');
SELECT
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY prev_date),
rows_count = COUNT(*) OVER (PARTITION BY id ORDER BY prev_date),
vol_total = SUM(volume) OVER (PARTITION BY id ORDER BY prev_date),
*
FROM #t;
I get the following result:
However, this is NOT what I expected: in all two rows the rows_count must be 2 and vol_total must be 300:
The workaround would be to add ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. However, I thought that there must be another way.
In the end of the day I have found out that the ORDER BY clause must use id field rather prev_date field:
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY id),
rows_count = COUNT(*) OVER (PARTITION BY id ORDER BY id),
vol_total = SUM(volume) OVER (PARTITION BY id ORDER BY id)
After this change the query's output is as expected.
But! I don't understand why is this so? How come the ordering affects partitioning?

For Aggregate functions generally it is not required to have order in the window definition unless you want to do the aggregation one at a time in an ordered fashion, it is like running total. Simply removing the orders will fix the problem.
If I want to explain it from another way it would be like a window that is expanding row by row as you move on to another row. It is started with the first row, calculate the aggregation with all the rows from before (which in the first row is just the current row!) to the position of row.
if you remove the order, the aggregation will be computed for all the rows in the window definition and no order of applying window will take effect.
You can change the order in window definition to see the effect of it.
Of course, ranking functions need the order and this point is just for the aggregations.
DECLARE #t TABLE
(
id varchar(100),
volume float,
prev_date date
);
INSERT INTO #t VALUES
('0318610084', 100, '2019-05-16'),
('0318610084', 200, '2016-06-04');
SELECT
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY prev_date),
rows_count = COUNT(*) OVER (PARTITION BY id),
vol_total = SUM(volume) OVER (PARTITION BY id),
*
FROM #t;
Enabling order in the window for aggregations added after SqlServer 2012 and it was not part of the first release of the feature in 2005.
For a detailed explanation of the order in window functions on aggregates this is a great help:
Producing a moving average and cumulative total - SqlServer Documentation

Get the id of the row with the max value with two grouping

We have a data structure with four columns:
ContractoreName, ProjectCode, InvoiceID, OrderID
We want to group the data by both ContractoreName and ProjectCode columns, and then get the InvoiceID of the row for each group with MAX(OrderID).

You could use ROW_NUMBER:
SELECT ContractorName, ProjectName, OrderId, InvoiceId
FROM (SELECT *, ROW_NUMBER() OVER(PARTITION BY ContractorName, ProjectName
ORDER BY OrderId DESC) AS rn
FROM tab
) AS sub
WHERE rn = 1;

ROW_NUMBER() is what I would call the canonical solution. In many cases, an old-fashioned solution has better performance:
select t.*
from t
where t.orderid = (select max(t2.orderid)
from t t2
where t2.contractorname = t.contractorname and
t2.projectname = t.projectname
);
This is especially true if there is an index on (contractorname, projectname, orderid).
Why is this faster? Basically, SQL Server can scan the table doing a lookup in an index. The lookup is really fast because the index is designed for it, so the scan is just a little faster than a full table scan.
When using row_number(), SQL Server has to scan the table to calculate the row number (and that can use the index, so it might be fast). But then it has to go back to the table to fetch the columns and apply the where clause. So, even if it uses an index, it is doing more work.
EDIT:
I should also point out that this can be done without a subquery:
select distinct contractorname, projectname,
max(orderid) over (partition by contractorname, projectname) as lastest_order,
first_value(invoiceid) partition by (order by contractorname, projectname order by orderid desc) as lastest_invoice
from t;
Unfortunately, SQL Server doesn't offer first_value() as an aggregation function, but you can use select distinct and get the same effect.

SQL Server loop for changing multiple values of different users

I have the next table:
Supposing these users are sort in descending order based on their inserted date.
Now, what I want to do, is to change their sorting numbers in that way that for each user, the sorting number has to start from 1 up to the number of appearances of each user. The result should look something like:
Can someone provide me some clues of how to do it in sql server ? Thanks.

You can use the ROW_NUMBER ranking function to calculate a row's rank given a partition and order.
In this case, you want to calculate row numbers for each user PARTITION BY User_ID. The desired output shows that ordering by ID is enough ORDER BY ID.
SELECT
Id,
User_ID,
ROW_NUMBER() OVER (PARTITION BY User_ID ORDER BY Id) AS Sort_Number
FROM MyTable
There are other ranking functions you can use, eg RANK, DENSE_RANK to calculate a rank according to a score, or NTILE to calculate percentiles for each row.
You can also use the OVER clause with aggragets to create running totals or moving averages, eg SUM(Id) OVER (PARTITION BY User_ID ORDER BY Id) will create a running total of the Id values for each user.

use ROW_NUMBER() PARTITION BY User_Id
SELECT
Id,
[User_Id],
Sort_Number = ROW_NUMBER() OVER(PARTITION BY [User_Id]
ORDER BY [User_Id],[CreatedDate] DESC)
FROM YourTable

select id
,user_id
,row_number() over(partition by user_id order by user_id) as sort_number
from table

Use ranking function row_number()
select *,
row_number() over (partition by User_id order by user_id, date desc)
from table t

SQL Server Conditional Sort Performance Issue

I have a table and it have around 5 millions rows. When I try a conditional sort for this table, it takes around 25 secs, but when I change conditional sort to a certain sort criteria, it takes 1 second. Only difference like below;
--takes 1 second
ROW_NUMBER() OVER (ORDER BY OrderId DESC) AS RowNumber
--takes around 25 seconds
CASE #SortColumn WHEN 'OrderId' THEN ROW_NUMBER() OVER (ORDER BY OrderId DESC) AS RowNumber
Who can explain what is going on SQL server in this scenario?

OrderId must be indexed. Thus in the first instance:
ROW_NUMBER() OVER (ORDER BY OrderId DESC) AS RowNumber
SQL does not need to perform a sort having originally performed an index scan on the OrderId column. It knows that the index is ordered by the column you want to order by so does not need to perform another sort.
However in the second example SQL has to evaluate CASE #SortColumn WHEN 'OrderId' THEN ROW_NUMBER() OVER (ORDER BY OrderId DESC) END for each row. Thus it performs a Compute Scalar operation on each row to work out the result of the CASE statement. The results of this operation cannot be mapped to an index as they do not represent a column and a further sort operation is required. Over 5 million rows this is a very expensive operation.
If you were to run the queries over non-indexed columns:
--takes 25 second
ROW_NUMBER() OVER (ORDER BY NonIndexedColumn DESC) AS RowNumber
--takes around 25 seconds
CASE #SortColumn WHEN 'NonIndexedColumn' THEN ROW_NUMBER() OVER (ORDER BY NonIndexedColumn DESC) AS RowNumber
then both queries would presumably run equally slowly as SQL would have to sort in both instances (and not just used a sorted index). Thus passing in a column to sort by is always going to end up with slow performance over a large number of rows if someone picks a non-indexed column. You therefore need to ensure your results are filtered down to a manageable amount of rows prior to the ORDER BY being applied.