Computing AUC in SQL - auc

What's the best way to compute AUC in SQL?
Here is what I got (assuming table T(label, confid) and label=0,1):
SELECT sum(cumneg * label) * 1e0 / (sum(label) * sum(1-label)) AS auc
FROM (
SELECT label,
sum(1-label) OVER(ORDER BY confid ROWS UNBOUNDED PRECEDING) (BIGINT) cumneg
FROM T
) t;
I have to multiply by 1e0 in Teradata to get a real result. The Bigint cast is necessary to avoid overflow.

Here is a slightly different and maybe simpler solution I found:
SELECT (sum(label*r) - 0.5*sum(label)*(sum(label)+1)) / (sum(label) * sum(1-label)) AS auc
FROM (
SELECT label, row_number() OVER (ORDER BY confid) r
FROM T
) t;
that returns the same result as the query in the question.
Update
This SQL query (as well as the one in the question) are non-deterministic when there are multiple examples with the same prediction (confid) but different labels. To compute a deterministic AUC using interpolation the query can be modified as follows:
SELECT (sum(pos*r) - 0.5*sum(pos)*(sum(pos)+1) - 0.5*sum(pos*neg)) /
(sum(pos) * sum(neg)) AS auc
FROM (
SELECT pos, neg,
sum(pos+neg) OVER (ORDER BY confid ROWS UNBOUNDED PRECEDING) r
FROM (
SELECT confid, sum(label) AS pos, sum(1-label) AS neg
FROM T
GROUP BY confid) t
) t;
In the AUC formula, the denominator is the total number of pairs (positive X negative). The numerator computes how many of the pairs are are ranked correctly. sum(pos*r) computes the total number of pairs so far (based on confidence order). That number includes positive X positive pairs so the second term subtracts those. Finally, the last term subtracts half of positive X negative pairs with the same prediction.

Below pseudo-SQL takes advantage of the fact that AUC ROC is the same as probability that the predicted score distinguishes a random positive and a random negative label. SQL assumes that both labels have at least 10000 elements. The calculated AUC is not exact, but randomised. See also the same question for R.
WITH POSITIVE_SCORES AS (
select
score as p_pos
from
TABLE
where label = positive
order by rand()
limit 10000
),
NEGATIVE_SCORES AS (
select
score as p_neg
from
TABLE
where label = negative
order by rand()
limit 10000
)
select
avg(case
when p_pos > p_neg then 1
when p_pos = p_neg then 0.5
else 0
end) as auc
from
POSITIVE_SCORES
cross join
NEGATIVE_SCORES

For calculating exact deterministic AUC score, we should aggregate by "confid" to handle cases when not all confidence values are unique. Then we just calculate trapezoid area for each unique confidence value and sum all. Also, additional check for the case when all labels are zeros or ones. Note that type can be overflowed because of multiplication - you can prevent it using BIGINT.
MS SQL Implementation:
select
IIF(SUM(Ones) * SUM(Zeros) <> 0,
SUM(IIF(Zeros * Ones > 0, 0.5 * Zeros * Ones + Height * Ones, Height * Ones)) / (SUM(Ones) * SUM(Zeros)), 0)
from (
select
Zeros,
Ones,
SUM(IIF(Zeros * Ones > 0, 0, Zeros) + IIF(PrevZeros * PrevOnes > 0, PrevZeros, 0)) OVER (ORDER BY PD) as Height
from (
select
confid as PD,
SUM(label) as Ones,
SUM(ABS(1 - label)) as Zeros,
LAG(SUM(label), 1, NULL) OVER (ORDER BY confid) as PrevOnes,
LAG(SUM(ABS(1 - label)), 1, NULL) OVER (ORDER BY confid) as PrevZeros
from T
group by confid
) q1
) q2;

Related

How the NTILE (MS SQL) function ranks equal values

I did the distribution of ratings by parameters: RECENCY aka R, MONETARY aka M, FREQUENCY aka F, using NTILE function. All parameters varies from 1 to positive infinity.
My script:
SELECT
r.phone,
r.R,
r.F,
r.M,
NTILE(5) OVER (ORDER BY R desc) as R_S,
NTILE(5) OVER (ORDER BY F asc) as F_S,
NTILE(5) OVER (ORDER BY M asc) as M_S
FROM rfm_raw r
but I have a lot of equal parameters. For example FREQUENCY = 1. I want to know how NTILE function sorting rows with the same FREQUENCY?
This is image of F_SCORE variation calculated by NTILE function with the same FREQUENCY. F_SCORE changed at different periods because of different data provided to script, but with the smalest FREQUENCY the same ID had bigger F_SCORE at 2022-05-01 then at 2022-04-01.

How to use recursive CTE to add resolution to a data set

I'm attempting to create a recursive CTE statement that adds blank rows in between data points that will later for interpolation. I'm a beginner with SQL and this is my first time using CTE's and am having some difficulty finding the proper way to do this.
I've attempted a few different slight variations on the code I have provided below after some research but haven't grasped a good enough understanding to see my issue yet. The following code should simulate sparse sampling by taking a observation every 4 hours from the sample data set and the second portion should add rows with there respective x values every 0.1 of an hour which will later be filled with interpolated values derived from a cubic spline.
--Sample Data
create table #temperatures (hour integer, temperature double precision);
insert into #temperatures (hour, temperature) values
(0,18.5),
(1,16.9),
(2,15.3),
(3,14.1),
(4,13.8),
(5,14.7),
(6,14.7),
(7,13.5),
(8,12.2),
(9,11.4),
(10,10.9),
(11,10.5),
(12,12.3),
(13,16.4),
(14,22.3),
(15,27.2),
(16,31.1),
(17,34),
(18,35.6),
(19,33.1),
(20,25.1),
(21,21.3),
(22,22.3),
(23,20.3),
(24,18.4),
(25,16.8),
(26,15.6),
(27,15.4),
(28,14.7),
(29,14.1),
(30,14.2),
(31,14),
(32,13.9),
(33,13.9),
(34,13.6),
(35,13.1),
(36,15),
(37,18.2),
(38,21.8),
(39,24.1),
(40,25.7),
(41,29.9),
(42,28.9),
(43,31.7),
(44,29.4),
(45,30.7),
(46,29.9),
(47,27);
--1
WITH xy (x,y)
AS
(
SELECT TOP 12
CAST(hour AS double precision) AS x
,temperature AS y
FROM #temperatures
WHERE cast(hour as integer) % 4 = 0
)
Select x,y
INTO #xy
FROM xy
Select [x] As [x_input]
INTO #x_series
FROM #xy
--2
with recursive
, x_series(input_x) as (
select
min(x)
from
#xy
union all
select
input_x + 0.1
from
x_series
where
input_x + 0.1 < (select max(x) from x)
)
, x_coordinate as (
select
input_x
, max(x) over(order by input_x) as previous_x
from
x_series
left join
#xy on abs(x_series.input_x - xy.x) < 0.001
)
The first CTE works as expected and produces a list of 12 (a sample every 4 hours for two days) but the second produces syntax error. The expected out put would be something like
(4,13.8), (4.1,null/0), (4.2,null/0),....., (8,12.2)
I dont think you need recursive.
What about this:
SQL DEMO
SELECT DISTINCT n = number *1.0 /10 , #xy.x, #xy.y
FROM master..[spt_values] step
LEFT JOIN #xy
ON step.number*1.0 /10 = #xy.x
WHERE number BETWEEN 40 AND 480
This 480 is based on the two days you mention.
OUTPUT
You dont even need the temporal table
SELECT DISTINCT n = number *1.0 /10 , #temperatures.temperature
FROM master..[spt_values] step
LEFT JOIN #temperatures
ON step.number *1.0 / 10 = #temperatures.hour
AND #temperatures.hour % 4 = 0
WHERE number BETWEEN 40 AND 480;
I don't think you need a recursive CTE here. I think a solution like this would be a better approach. Modify accordingly.
DECLARE #max_value FLOAT =
(SELECT MAX(hour) FROM #temperatures) * 10
INSERT INTO #temperatures (hour, temperature)
SELECT X.N / 10, NULL
FROM (
select CAST(ROW_NUMBER() over(order by t1.number) AS FLOAT) AS N
from master..spt_values t1
cross join master..spt_values t2
) X
WHERE X.N <= #max_value
AND X.N NOT IN (SELECT hour FROM #temperatures)
Use the temp table #xy produced in --1 you have, the following will give you a x series:
;with x_series(input_x)
as
(
select min(x) AS input_x
from #xy
union all
select input_x + 0.1
from x_series
where input_x + 0.1 < (select max(x) from #xy)
)
SELECT * FROM x_series;

Select top 10 percent, also bottom percent in SQL Server

I have two questions:
When using the select top 10 percent statement, for example on a test database with 100 scores, like this:
Select top 10 percent score
from test
Would SQL Server return the 10 highest scores, or just the top 10 obs based on how the data look like now (e.g. if the data is entered into database in a way that lowest score appears first, then would this return the lowest 10 scores)?
I want to be able to get the top 10 highest scores and bottom 10 lowest scores out of this 100 scores, what should I do?
You could also use the NTILE window function to group your scores into 10 groups of data - group no. 1 would be the lowest 10%, group no. 10 would be the top 10%:
;WITH Percentile AS
(
SELECT
Score,
ScoreGroup = NTILE(10) OVER(ORDER BY Score)
FROM
test
)
SELECT *
FROM Percentile
WHERE ScoreGroup IN (1, 10)
Using a UNION ALL means that it will count all rows twice.
You can do it with a single count as below. Whether or not this will be more efficient will depend (e.g. on indexes).
WITH T
AS (SELECT *,
1E0 * ROW_NUMBER()
OVER (
ORDER BY score) / COUNT(*)
OVER() AS p
FROM test)
SELECT *
FROM T
WHERE p < 0.1
OR p > 0.9
select score from
(Select top 10 percent score
from test
order by score desc
)a
union all
select score from
(select top 10 percent score
from test
order by score asc
)b
if duplicates are allowed use union
Use ascending in your query for the top 90. Then, descending in your query for the top 10. Then, union these two queries

How do I exclude outliers from an aggregate query?

I'm creating a report comparing total time and volume across units. Here a simplification of the query I'm using at the moment:
SELECT m.Unit,
COUNT(*) AS Count,
SUM(m.TimeInMinutes) AS TotalTime
FROM main_table m
WHERE m.unit <> ''
AND m.TimeInMinutes > 0
GROUP BY m.Unit
HAVING COUNT(*) > 15
However, I have been told that I need to exclude cases where the row's time is in the highest or lowest 5% to try and get rid of a few wacky outliers. (As in, remove the rows before the aggregates are applied.)
How do I do that?
You can exclude the top and bottom x percentiles with NTILE
SELECT m.Unit,
COUNT(*) AS Count,
SUM(m.TimeInMinutes) AS TotalTime
FROM
(SELECT
m.Unit,
NTILE(20) OVER (ORDER BY m.TimeInMinutes) AS Buckets
FROM
main_table m
WHERE
m.unit <> '' AND m.TimeInMinutes > 0
) m
WHERE
Buckets BETWEEN 2 AND 19
GROUP BY m.Unit
HAVING COUNT(*) > 15
Edit: this article has several techniques too
One way would be to exclude the outliers with a not in clause:
where m.ID not in
(
select top 5 percent ID
from main_table
order by
TimeInMinutes desc
)
And another not in clause for the bottom five percent.
NTile is quite inexact. If you run NTile against the sample view below, you will see that it catches some indeterminate number of rows instead of 90% from the center. The suggestion to use TOP 95%, then reverse TOP 90% is almost correct except that 90% x 95% gives you only 85.5% of the original dataset. So you would have to do
select top 94.7368 percent *
from (
select top 95 percent *
from
order by .. ASC
) X
order by .. DESC
First create a view to match your table column names
create view main_table
as
select type unit, number as timeinminutes from master..spt_values
Try this instead
select Unit, COUNT(*), SUM(TimeInMinutes)
FROM
(
select *,
ROW_NUMBER() over (order by TimeInMinutes) rn,
COUNT(*) over () countRows
from main_table
) N -- Numbered
where rn between countRows * 0.05 and countRows * 0.95
group by Unit, N.countRows * 0.05, N.countRows * 0.95
having count(*) > 20
The HAVING clause is applied to the remaining set after removing outliers.
For a dataset of 1,1,1,1,1,1,2,5,6,19, the use of ROW_NUMBER allows you to correctly remove just one instance of the 1's.
I think the most robust way is to sort the list into order and then exclude the top and bottom extremes. For a hundred values, you would sort ascending and take the first 95 PERCENT, then sort descending and take the first 90 PERCENT.

SQL Filtering A Result Set To Return A Maximum Amount Of Rows At Even Intervals

I currently use SQL2008 where I have a stored procedure that fetches data from a table that then gets fed in to a line graph on the client. This procedure takes a from date and a too date as parameters to filter the data. This works fine for small datasets but the graph gets a bit muddled when a large date range is entered causes thousends of results.
What I'd like to do is provide a max amount of records to be returned and return records at evenly spaced intervals to give that amount. For example say I limited it to 10 records and the result set was 100 records I'd like the stored procedure to return every 10th record.
Is this possible wihtout suffering big performance issues and what would be the best way to achieve it? I'm struggling to find a way to do it without cursors and if thats the case I'd rather not do it at all.
Thanks
Assuming you use at least SQL2005, you could do somesting like
WITH p as (
SELECT a, b,
row_number() OVER(ORDER BY time_column) as row_no,
count() OVER() as total_count
FROM myTable
WHERE <date is in range>
)
SELECT a, b
FROM p
WHERE row_no % (total_cnt / 10) = 1
The where condition in the bottom calculates the modulus of the row number by the total number of records divided by the required number of final records.
If you want to use the average instead of one specific value, you would extend this as follows:
WITH p as (
SELECT a, b,
row_number() OVER(ORDER BY time_column) as row_no,
count() OVER() as total_count
FROM myTable
WHERE <date is in range>
),
a as (
SELECT a, b, row_no, total_count,
avg(a) OVER(partition by row_no / (total_cnt / 10)) as avg_a
FROM p
)
SELECT a, b, avg_a
FROM a
WHERE row_no % (total_cnt / 10) = 1
The formula to select one of the values in the final WHERE clause is used with the % replaced by / in the partition by clause.

Resources