Creating a table with percentiles from raw database - sql-server

I have a database with 160 million something records in it, that is segmented by an code called TMC. The TMC represents a section of highway that is then measured every five minutes for speed and travel time. So the TMC isn't a unique identifier as it the same for every five minutes for all days of the year for that one section of highway. There are 3,440 unique TMCs, as for each TMC, I am trying to calculate a percentile of travel times for an entire year for a specific time of day.
I can get the code for the percentiles to work, but I do not understand how to create and update a table in SQL so the percentiles can be dumped and stored within it. Something to do with the with statement being used to get the percentile does not mesh well with update functions. I normally just use select and copy the data into excel, and then reimport the data into my SQL database, but I am trying to see if I can automate this process as much as possible.
Here is the code that I got so far.
create table TMCF5 (
TMC_code varchar(50),
P95M varchar(50),
P50M varchar(50),
P95A varchar(50),
P50A varchar(50))
go
WITH PERCENTILES_Afternoon AS (SELECT TMC_code, EPOCH, percentile_CONT(.95)
WITHIN GROUP (ORDER BY cast(travel_time_minutes as float)) OVER (PARTITION BY TMC_code) AS P95afternoon, percentile_CONT(.50)
WITHIN GROUP (ORDER BY cast(travel_time_minutes as float)) OVER (PARTITION BY TMC_code) AS P50afternoon FROM [dbo].[AR_2018_TRUCKS_1_3]
WHERE DATEPART(HOUR, EPOCH) between 16 and 17 AND (WKDAY != 'SAT' and WKDAY != 'SUN'))
insert tmcf5 (tmc_code) select tmc_code from percentiles_afternoon group by tmc_code
go
WITH PERCENTILES_Afternoon2 AS (SELECT TMC_code, EPOCH, percentile_CONT(.95)
WITHIN GROUP (ORDER BY cast(travel_time_minutes as float)) OVER (PARTITION BY TMC_code) AS P95afternoon, percentile_CONT(.50)
WITHIN GROUP (ORDER BY cast(travel_time_minutes as float)) OVER (PARTITION BY TMC_code) AS P50afternoon FROM [dbo].[AR_2018_TRUCKS_1_3]
WHERE DATEPART(HOUR, EPOCH) between 16 and 17 AND (WKDAY != 'SAT' and WKDAY != 'SUN'))
update TMCF5 set tmcF5.p95A = percentiles_Afternoon2.P95Afternoon from percentiles_afternoon2
join percentiles_afternoon2 on tmcf5.tmc_code = percentiles_afternoon2.tmc_code

The text in your error message that you provided in the comments leads me to identify the problem as being present in this statement:
update TMCF5 set tmcF5.p95A = percentiles_Afternoon2.P95Afternoon from percentiles_afternoon2
join percentiles_afternoon2 on tmcf5.tmc_code = percentiles_afternoon2.tmc_code
It has 'percentiles_afternoon2' listed as both tables. It seems you wanted to reference 'tmcf5' as one of your objects.
Also, if your first insert statement purely serves the purpose of bringing in your tmc_codes, then just simplify the query to:
-- no cte's
insert tmcf5 (tmc_code)
select
distinct tmc_code
from AR_2018_TRUCKS_1_3;

Related

Example ID in aggregate queries SQL Server [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I have a query that aggregates a large amount of transaction data. The raw data has unique IDs for every transaction and I need to have an example ID in each aggregated row. It doesn't matter which ID is chosen as long as the ID is intact so that we can go back and look up individual examples for a grouping from the raw transaction data if need be. I do not have control over the raw data.
For example, this:
ID Group
6457982468798542364879 Group 1
FR65487985412354 Group 1
1564879541356897 Group 2
6548941236584269 Group 2
Into this:
ExampleID Group Volume
6457982468798542364879 Group 1 2
1564879541356897 Group 2 2
I was trying to use MAX to do this but that doesn't work when there are ID's that have letters or more than 20 characters. I also tried using STRING_AGG but kept hitting its character limit and I only want a single ID for each group anyways.
The data sets are large so efficiency is a consideration. I'm using SQL Server version 2017.
If the example ID doesn't matter, pick an aggregate function like MIN() or MAX() and use that to show one ID from the group.
Assuming data as follows
CREATE TABLE #Test (ID nvarchar(30), Grp nvarchar(10))
INSERT INTO #Test (ID, Grp) VALUES
(N'6457982468798542364879' ,N'Group 1'),
(N'FR65487985412354' ,N'Group 1'),
(N'1564879541356897' ,N'Group 2'),
(N'6548941236584269' ,N'Group 2')
Update:
Based on comments regarding MAX etc, you can just use it in one easy command (don't try to CAST/CONVERT it to anything, just use MAX on its own).
SELECT Grp,
MAX(ID) AS ExampleID,
COUNT(*) AS Volume
FROM #Test
GROUP BY Grp
My original version (which was using FIRST_VALUE to get an example row) needed a window function as a sub-query/CTE. But use the one above as it's a lot clearer and easier to maintain, and probably uses less processing time. Here is the original one just for history's sake:
; WITH Src AS
(SELECT Grp,
FIRST_VALUE(ID) OVER (PARTITION BY Grp ORDER BY ID) AS exampleID
FROM #Test
)
SELECT Grp,
ExampleID,
COUNT(*) AS Volume
FROM Src
GROUP BY Grp,
ExampleID
Here's an updated db<>fiddle with both examples.
It seems an efficient way would be to use 2 windowing functions: ROW_NUMBER() to eliminate duplicates, and COUNT() to get the Volume. Something like this.
Data
drop table if exists #Test;
go
create table #Test (
ExampleID nvarchar(30),
Grp nvarchar(10));
go
INSERT INTO #Test (ExampleID, Grp) VALUES
(N'6457982468798542364879', N'Group 1'),
(N'FR65487985412354', N'Group 1'),
(N'1564879541356897', N'Group 2'),
(N'6548941236584269', N'Group 2');
Query
with grp_cte as (
select *, row_number() over (partition by Grp order by (select null)) rn,
count(*) over (partition by Grp order by (select null)) cn
from #Test)
select ExampleID, Grp as [Group], cn Volume
from grp_cte
where rn=1;
Output
ExampleID Group Volume
6457982468798542364879 Group 1 2
1564879541356897 Group 2 2
Just use ID. This should do the job.
SELECT
ID AS ExampleID,
[GROUP],
COUNT(ID) as VOLUME
FROM MyTable
GROUP BY [GROUP]

The field in ORDER BY affects the result of window functions

I have simple T-SQL query, which calculates row number, rows count and total volume across all records:
DECLARE #t TABLE
(
id varchar(100),
volume float,
prev_date date
);
INSERT INTO #t VALUES
('0318610084', 100, '2019-05-16'),
('0318610084', 200, '2016-06-04');
SELECT
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY prev_date),
rows_count = COUNT(*) OVER (PARTITION BY id ORDER BY prev_date),
vol_total = SUM(volume) OVER (PARTITION BY id ORDER BY prev_date),
*
FROM #t;
I get the following result:
However, this is NOT what I expected: in all two rows the rows_count must be 2 and vol_total must be 300:
The workaround would be to add ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. However, I thought that there must be another way.
In the end of the day I have found out that the ORDER BY clause must use id field rather prev_date field:
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY id),
rows_count = COUNT(*) OVER (PARTITION BY id ORDER BY id),
vol_total = SUM(volume) OVER (PARTITION BY id ORDER BY id)
After this change the query's output is as expected.
But! I don't understand why is this so? How come the ordering affects partitioning?
For Aggregate functions generally it is not required to have order in the window definition unless you want to do the aggregation one at a time in an ordered fashion, it is like running total. Simply removing the orders will fix the problem.
If I want to explain it from another way it would be like a window that is expanding row by row as you move on to another row. It is started with the first row, calculate the aggregation with all the rows from before (which in the first row is just the current row!) to the position of row.
if you remove the order, the aggregation will be computed for all the rows in the window definition and no order of applying window will take effect.
You can change the order in window definition to see the effect of it.
Of course, ranking functions need the order and this point is just for the aggregations.
DECLARE #t TABLE
(
id varchar(100),
volume float,
prev_date date
);
INSERT INTO #t VALUES
('0318610084', 100, '2019-05-16'),
('0318610084', 200, '2016-06-04');
SELECT
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY prev_date),
rows_count = COUNT(*) OVER (PARTITION BY id),
vol_total = SUM(volume) OVER (PARTITION BY id),
*
FROM #t;
Enabling order in the window for aggregations added after SqlServer 2012 and it was not part of the first release of the feature in 2005.
For a detailed explanation of the order in window functions on aggregates this is a great help:
Producing a moving average and cumulative total - SqlServer Documentation

Using a running total calculated column in SQL Server table variable

I have inherited a stored procedure that utilizes a table variable to store data, then updates each row with a running total calculation. The order of the records in the table variable is very important, as we want the volume to be ordered highest to lowest (i.e. the running total will get increasingly larger as you go down the table).
My problem is, during the step where the table variable is updated, the running total seems to be calculating , but not in a way that the data in the table variable was previously sorted by (descending by highest volume)
DECLARE #TableVariable TABLE ([ID], [Volume], [SortValue], [RunningTotal])
--Populate table variable and order by the sort value...
INSERT INTO #TableVariable (ID, Volume, SortValue)
SELECT
[ID], [Volume], ABS([Volume]) as SortValue
FROM
dbo.VolumeTable
ORDER BY
SortValue DESC
--Set TotalVolume variable...
SELECT#TotalVolume = ABS(sum([Volume]))
FROM #TableVariable
--Calculate running total, update rows in table variable...I believe this is where problem occurs?
SET #RunningTotal = 0
UPDATE #TableVariable
SET #RunningTotal = RunningTotal = #RunningTotal + [Volume]
FROM #TableVariable
--Output...
SELECT
ID, Volume, SortValue, RunningTotal
FROM
#TableVariable
ORDER BY
SortValue DESC
The result is, the record that had the highest volume, that I would have expected the running total to calculate on first (thus running total = [volume]), somehow ends up much further down in the list. The running total seems to calculate randomly
Here is what I would expect to get:
But here is what the code actually generates:
Not sure if there is a way to get the UPDATE statement to be enacted on the table variable in such a way that it is ordered by volume desc? From what Ive read so far, it could be an issue with the sorting behavior of a table variable but not sure how to correct? Can anyone help?
GarethD provided the definitive link to the multiple ways of calculating running totals and their performance. The correct one is both the simplest and fastest, 300 times faster that then quirky update. That's because it can take advantage of any indexes that cover the sort column, and because it's a lot simpler.
I repeat it here to make clear how much simpler this is when the database provided the appropriate windowing functions
SELECT
[Date],
TicketCount,
SUM(TicketCount) OVER (ORDER BY [Date] RANGE UNBOUNDED PRECEDING)
FROM dbo.SpeedingTickets
ORDER BY [Date];
The SUM line means: Sum all ticket counts over all (UNBOUNDED) the rows that came before (PRECEDING) the current one if they were ordered by date
That ends up being 300 times faster than the quirky update.
The equivalent query for VolumeTable would be:
SELECT
ID,
Volume,
ABS(Volume) as SortValue,
SUM(Volume) OVER (ORDER BY ABS(Volume) DESC RANGE UNBOUNDED PRECEDING)
FROM
VolumeTable
ORDER BY ABS(Volume) DESC
Note that this will be a lot faster if there is an index on the sort column (Volume), and ABS isn't used. Applying any function on a column means that the optimizer can't use any indexes that cover it, because the actual sort value is different than the one stored in the index.
If the table is very large and performance suffers, you could create a computed column and create an index on it
Take a peek at the Window functions offered in SQL
For example
Declare #YourTable table (ID int,Volume int)
Insert Into #YourTable values
(100,1306489),
(125,898426),
(150,907404)
Select ID
,Volume
,RunningTotal = sum(Volume) over (Order by Volume Desc)
From #YourTable
Order By Volume Desc
Returns
ID Volume RunningTotal
100 1306489 1306489
150 907404 2213893
125 898426 3112319
To be clear, The #YourTable is for demonstrative purposes only. There should be no need to INSERT your actual data into a table variable.
EDIT to Support 2008 (Good news is Row_Number() is supported in 2008)
Select ID
,Volume
,RowNr=Row_Number() over (Order by Volume Desc)
Into #Temp
From #YourTable
Select A.ID
,A.Volume
,RunningTotal = sum(B.Volume)
From #Temp A
Join #Temp B on (B.RowNr<=A.RowNr)
Group By A.ID,A.Volume
Order By A.Volume Desc

SQL running sum for an MVC application

I need a faster method to calculate and display a running sum.
It's an MVC telerik grid that queries a view that generates a running sum using a sub-query. The query takes 73 seconds to complete, which is unacceptable. (Every time the user hits "Refresh Forecast Sheet", it takes 73 seconds to re-populate the grid.)
The query looks like this:
SELECT outside.EffectiveDate
[omitted for clarity]
,(
SELECT SUM(b.Amount)
FROM vCI_UNIONALL inside
WHERE inside.EffectiveDate <= outside.EffectiveDate
) AS RunningBalance
[omitted for clarity]
FROM vCI_UNIONALL outside
"EffectiveDate" on certain items can change all the time... New items can get added, etc. I certainly need something that can calculate the running sum on the fly (when the Refresh button is hit). Stored proc or another View...? Please advise.
Solution: (one of many, this one is orders of magnitude faster than a sub-query)
Create a new table with all the columns in the view except for the RunningTotal col. Create a stored procedure that first truncates the table, then INSERT INTO the table using SELECT all columns, without the running sum column.
Use update local variable method:
DECLARE #Amount DECIMAL(18,4)
SET #Amount = 0
UPDATE TABLE_YOU_JUST_CREATED SET RunningTotal = #Amount, #Amount = #Amount + ISNULL(Amount,0)
Create a task agent that will run the stored procedure once a day. Use the TABLE_YOU_JUST_CREATED for all your reports.
Take a look at this post
Calculate a Running Total in SQL Server
If you have SQL Server Denali, you can use new windowed function.
In SQL Server 2008 R2 I suggest you to use recursive common table expression.
Small problem in CTE is that for fast query you have to have identity column without gaps (1, 2, 3,...) and if you don't have such a column you have to create a temporary or variable table with such a column and to move you your data there.
CTE approach will be something like this
declare #Temp_Numbers (RowNum int, Amount <your type>, EffectiveDate datetime)
insert into #Temp_Numbers (RowNum, Amount, EffectiveDate)
select row_number() over (order by EffectiveDate), Amount, EffectiveDate
from vCI_UNIONALL
-- you can also use identity
-- declare #Temp_Numbers (RowNum int identity(1, 1), Amount <your type>, EffectiveDate datetime)
-- insert into #Temp_Numbers (Amount, EffectiveDate)
-- select Amount, EffectiveDate
-- from vCI_UNIONALL
-- order by EffectiveDate
;with
CTE_RunningTotal
as
(
select T.RowNum, T.EffectiveDate, T.Amount as Total_Amount
from #Temp_Numbers as T
where T.RowNum = 1
union all
select T.RowNum, T.EffectiveDate, T.Amount + C.Total_Amount as Total_Amount
from CTE_RunningTotal as C
inner join #Temp_Numbers as T on T.RowNum = C.RowNum + 1
)
select C.RowNum, C.EffectiveDate, C.Total_Amount
from CTE_RunningTotal as C
option (maxrecursion 0)
There're may be some questions with duplicates EffectiveDate values, it depends on how you want to work with them - do you want to them to be ordered arbitrarily or do you want them to have equal Amount?

MSSQL 2008 R2 Selecting rows withing certain range - Paging - What is the best way

Currently this sql query is able to select between the rows i have determined. But are there any better approach for this ?
select * from (select *, ROW_NUMBER() over (order by Id desc) as RowId
from tblUsersMessages ) dt
where RowId between 10 and 25
Depends on your indexes.
Sometimes this can be better
SELECT *
FROM tblUsersMessages
WHERE Id IN (SELECT Id
FROM (select Id,
ROW_NUMBER() over (order by Id desc) as RowId
from tblUsersMessages) dt
WHERE RowId between 10 and 25)
If a narrower index exists that can be used to quickly find the Id values within the range. See my answer here for an example that demonstrates the type of issue that can arise.
You need to check the execution plans and output of SET STATISTICS IO ON for your specific case.

Resources