How the NTILE (MS SQL) function ranks equal values - sql-server

I did the distribution of ratings by parameters: RECENCY aka R, MONETARY aka M, FREQUENCY aka F, using NTILE function. All parameters varies from 1 to positive infinity.
My script:
SELECT
r.phone,
r.R,
r.F,
r.M,
NTILE(5) OVER (ORDER BY R desc) as R_S,
NTILE(5) OVER (ORDER BY F asc) as F_S,
NTILE(5) OVER (ORDER BY M asc) as M_S
FROM rfm_raw r
but I have a lot of equal parameters. For example FREQUENCY = 1. I want to know how NTILE function sorting rows with the same FREQUENCY?
This is image of F_SCORE variation calculated by NTILE function with the same FREQUENCY. F_SCORE changed at different periods because of different data provided to script, but with the smalest FREQUENCY the same ID had bigger F_SCORE at 2022-05-01 then at 2022-04-01.

Related

Computing AUC in SQL

What's the best way to compute AUC in SQL?
Here is what I got (assuming table T(label, confid) and label=0,1):
SELECT sum(cumneg * label) * 1e0 / (sum(label) * sum(1-label)) AS auc
FROM (
SELECT label,
sum(1-label) OVER(ORDER BY confid ROWS UNBOUNDED PRECEDING) (BIGINT) cumneg
FROM T
) t;
I have to multiply by 1e0 in Teradata to get a real result. The Bigint cast is necessary to avoid overflow.
Here is a slightly different and maybe simpler solution I found:
SELECT (sum(label*r) - 0.5*sum(label)*(sum(label)+1)) / (sum(label) * sum(1-label)) AS auc
FROM (
SELECT label, row_number() OVER (ORDER BY confid) r
FROM T
) t;
that returns the same result as the query in the question.
Update
This SQL query (as well as the one in the question) are non-deterministic when there are multiple examples with the same prediction (confid) but different labels. To compute a deterministic AUC using interpolation the query can be modified as follows:
SELECT (sum(pos*r) - 0.5*sum(pos)*(sum(pos)+1) - 0.5*sum(pos*neg)) /
(sum(pos) * sum(neg)) AS auc
FROM (
SELECT pos, neg,
sum(pos+neg) OVER (ORDER BY confid ROWS UNBOUNDED PRECEDING) r
FROM (
SELECT confid, sum(label) AS pos, sum(1-label) AS neg
FROM T
GROUP BY confid) t
) t;
In the AUC formula, the denominator is the total number of pairs (positive X negative). The numerator computes how many of the pairs are are ranked correctly. sum(pos*r) computes the total number of pairs so far (based on confidence order). That number includes positive X positive pairs so the second term subtracts those. Finally, the last term subtracts half of positive X negative pairs with the same prediction.
Below pseudo-SQL takes advantage of the fact that AUC ROC is the same as probability that the predicted score distinguishes a random positive and a random negative label. SQL assumes that both labels have at least 10000 elements. The calculated AUC is not exact, but randomised. See also the same question for R.
WITH POSITIVE_SCORES AS (
select
score as p_pos
from
TABLE
where label = positive
order by rand()
limit 10000
),
NEGATIVE_SCORES AS (
select
score as p_neg
from
TABLE
where label = negative
order by rand()
limit 10000
)
select
avg(case
when p_pos > p_neg then 1
when p_pos = p_neg then 0.5
else 0
end) as auc
from
POSITIVE_SCORES
cross join
NEGATIVE_SCORES
For calculating exact deterministic AUC score, we should aggregate by "confid" to handle cases when not all confidence values are unique. Then we just calculate trapezoid area for each unique confidence value and sum all. Also, additional check for the case when all labels are zeros or ones. Note that type can be overflowed because of multiplication - you can prevent it using BIGINT.
MS SQL Implementation:
select
IIF(SUM(Ones) * SUM(Zeros) <> 0,
SUM(IIF(Zeros * Ones > 0, 0.5 * Zeros * Ones + Height * Ones, Height * Ones)) / (SUM(Ones) * SUM(Zeros)), 0)
from (
select
Zeros,
Ones,
SUM(IIF(Zeros * Ones > 0, 0, Zeros) + IIF(PrevZeros * PrevOnes > 0, PrevZeros, 0)) OVER (ORDER BY PD) as Height
from (
select
confid as PD,
SUM(label) as Ones,
SUM(ABS(1 - label)) as Zeros,
LAG(SUM(label), 1, NULL) OVER (ORDER BY confid) as PrevOnes,
LAG(SUM(ABS(1 - label)), 1, NULL) OVER (ORDER BY confid) as PrevZeros
from T
group by confid
) q1
) q2;

Should I use a cursor for this?

I have a table with three fields. Group number, X-coord and Y-coord. There can be from 0 to about 10 rows within each group number.
What I want to do is calculate the maximum and minimum distance between points within each group. Obviously, this will only give you a value if there are 2 or more rows within that group.
Output should consist of fields: group number, minDistance, maxDistance.
Is a cursor a good solution for this?
(Coordinates are in WGS84 and I have a working formula for calculating distances)
My reasoning for using a cursor is that I cannot avoid doing a cross join for each group and then applying the formula for each result of the cross join.
I wouldn't use a cursor in your situation but preferably a scalar User Defined Function with the required group number in argument, and calculate the maximum distance for that group inside the UDF.
Please note the calculation algorithm inside the function is much simpler than what you may have.
create table dist (groupId int, X int, Y int)
insert into dist(groupid, x, y) values (1,14,20),(1,11,20),(1,10,22),(1,12,24),(1,11,28),(1,19,78)
insert into dist(groupid, x, y) values (2,10,20),(2,11,20),(2,10,22),(2,12,24),(2,11,28),(2,17,52)
create function dbo.getMinMaxDistanceForGroup (#groupId int)
returns table as return (
select MIN(SQRT(SQUARE(b.X - a.X) + SQUARE(b.Y - a.Y))) MinDistance,
MAX(SQRT(SQUARE(b.X - a.X) + SQUARE(b.Y - a.Y))) MaxDistance
from dist a cross join dist b
where a.groupId = #groupId and b.groupId = #groupId
)
select groupId, MinDistance, MaxDistance
from dist OUTER APPLY dbo.getMinMaxDistanceForGroup(groupId)
group by groupid, MinDistance, MaxDistance

Recursive Decaying Average in Sql Server 2012

I need to calculate a decaying average (cumulative moving?) of a set of values. The last value in the series is 50% weight, with the decayed average of all the prior series as the other 50% weight, recursively.
I came up with a CTE query that produces correct results, but it depends on a sequential row number. I'm wondering if there is a better way to do this in SQL 2012, maybe with the new windowing functions for Over(), or something like that?
In the live data, the rows are ordered by time. I can use an SQL view and ROW_NUMBER() to generate the necessary Row field for my CTE approach, but if there is a more efficient way to do this, I would like to keep this as efficient as possible.
I have a sample table with 2 columns: Row int, and Value Float. I have 6 sample data values of 1,2,3,4,4,4. The correct result should be 3.78125.
My solution is:
;WITH items AS (
SELECT TOP 1
Row, Value, Value AS Decayed
FROM Sample Order By Row
UNION ALL
SELECT v.Row, v.Value, Decayed * .5 + v.Value *.5 AS Decayed
FROM Sample v
INNER JOIN items itms ON itms.Row = v.Row-1
)
SELECT top 1 Decayed FROM items order by Row desc
This correctly produces 3.78125 with the test data. My question is: Is there a more efficient and/or simpler way to do this in SQL 2012, or is this about the only way to do it? Thanks.
One possible alternative would be
WITH T AS
(
SELECT
Value * POWER(5E-1, ROW_NUMBER()
OVER (ORDER BY Row DESC)
/* first row decays less so special cased */
-IIF(LEAD(Value) OVER (ORDER BY Row DESC) IS NULL,1,0))
as x
FROM Sample
)
SELECT SUM(x)
FROM T
SQL Fiddle
Or for the updated question using 60%/40%
WITH T AS
(
SELECT IIF(LEAD(Value) OVER (ORDER BY Row DESC) IS NULL, 1,0.6)
* Value
* POWER(4E-1, ROW_NUMBER() OVER (ORDER BY Row DESC) -1)
as x
FROM Sample
)
SELECT SUM(x)
FROM T
SQL Fiddle
both of the above perform a single pass through the data and can potentially use an index on Row INCLUDE(Value) to avoid a sort.

How do I exclude outliers from an aggregate query?

I'm creating a report comparing total time and volume across units. Here a simplification of the query I'm using at the moment:
SELECT m.Unit,
COUNT(*) AS Count,
SUM(m.TimeInMinutes) AS TotalTime
FROM main_table m
WHERE m.unit <> ''
AND m.TimeInMinutes > 0
GROUP BY m.Unit
HAVING COUNT(*) > 15
However, I have been told that I need to exclude cases where the row's time is in the highest or lowest 5% to try and get rid of a few wacky outliers. (As in, remove the rows before the aggregates are applied.)
How do I do that?
You can exclude the top and bottom x percentiles with NTILE
SELECT m.Unit,
COUNT(*) AS Count,
SUM(m.TimeInMinutes) AS TotalTime
FROM
(SELECT
m.Unit,
NTILE(20) OVER (ORDER BY m.TimeInMinutes) AS Buckets
FROM
main_table m
WHERE
m.unit <> '' AND m.TimeInMinutes > 0
) m
WHERE
Buckets BETWEEN 2 AND 19
GROUP BY m.Unit
HAVING COUNT(*) > 15
Edit: this article has several techniques too
One way would be to exclude the outliers with a not in clause:
where m.ID not in
(
select top 5 percent ID
from main_table
order by
TimeInMinutes desc
)
And another not in clause for the bottom five percent.
NTile is quite inexact. If you run NTile against the sample view below, you will see that it catches some indeterminate number of rows instead of 90% from the center. The suggestion to use TOP 95%, then reverse TOP 90% is almost correct except that 90% x 95% gives you only 85.5% of the original dataset. So you would have to do
select top 94.7368 percent *
from (
select top 95 percent *
from
order by .. ASC
) X
order by .. DESC
First create a view to match your table column names
create view main_table
as
select type unit, number as timeinminutes from master..spt_values
Try this instead
select Unit, COUNT(*), SUM(TimeInMinutes)
FROM
(
select *,
ROW_NUMBER() over (order by TimeInMinutes) rn,
COUNT(*) over () countRows
from main_table
) N -- Numbered
where rn between countRows * 0.05 and countRows * 0.95
group by Unit, N.countRows * 0.05, N.countRows * 0.95
having count(*) > 20
The HAVING clause is applied to the remaining set after removing outliers.
For a dataset of 1,1,1,1,1,1,2,5,6,19, the use of ROW_NUMBER allows you to correctly remove just one instance of the 1's.
I think the most robust way is to sort the list into order and then exclude the top and bottom extremes. For a hundred values, you would sort ascending and take the first 95 PERCENT, then sort descending and take the first 90 PERCENT.

SQL Filtering A Result Set To Return A Maximum Amount Of Rows At Even Intervals

I currently use SQL2008 where I have a stored procedure that fetches data from a table that then gets fed in to a line graph on the client. This procedure takes a from date and a too date as parameters to filter the data. This works fine for small datasets but the graph gets a bit muddled when a large date range is entered causes thousends of results.
What I'd like to do is provide a max amount of records to be returned and return records at evenly spaced intervals to give that amount. For example say I limited it to 10 records and the result set was 100 records I'd like the stored procedure to return every 10th record.
Is this possible wihtout suffering big performance issues and what would be the best way to achieve it? I'm struggling to find a way to do it without cursors and if thats the case I'd rather not do it at all.
Thanks
Assuming you use at least SQL2005, you could do somesting like
WITH p as (
SELECT a, b,
row_number() OVER(ORDER BY time_column) as row_no,
count() OVER() as total_count
FROM myTable
WHERE <date is in range>
)
SELECT a, b
FROM p
WHERE row_no % (total_cnt / 10) = 1
The where condition in the bottom calculates the modulus of the row number by the total number of records divided by the required number of final records.
If you want to use the average instead of one specific value, you would extend this as follows:
WITH p as (
SELECT a, b,
row_number() OVER(ORDER BY time_column) as row_no,
count() OVER() as total_count
FROM myTable
WHERE <date is in range>
),
a as (
SELECT a, b, row_no, total_count,
avg(a) OVER(partition by row_no / (total_cnt / 10)) as avg_a
FROM p
)
SELECT a, b, avg_a
FROM a
WHERE row_no % (total_cnt / 10) = 1
The formula to select one of the values in the final WHERE clause is used with the % replaced by / in the partition by clause.

Resources