Calculate Percentile Rank using NTILE? - sql-server

Need to calculate the percentile rank (1st - 99th percentile) for each student with a score for a single test.
I'm a little confused by the msdn definition of NTILE, because it does not explicitly mention percentile rank. I need some sort of assurance that NTILE is the correct keyword to use for calculating percentile rank.
declare #temp table
(
StudentId int,
Score int
)
insert into #temp
select 1, 20
union
select 2, 25
.....
select NTILE(100) OVER (order by Score) PercentileRank
from #temp
It looks correct to me, but is this the correct way to calculate percentile rank?

NTILE is absolutely NOT the same as percentile rank. NTILE simply divides up a set of data evenly by the number provided (as noted by RoyiNamir above). If you chart the results of both functions, NTILE will be a perfectly linear line from 1-to-n, whereas percentile rank will [usually] have some curves to it depending on your data.
Percentile rank is much more complicated than simply dividing it up by N. It then takes each row's number and figures out where in the distribution it lies, interpolating when necessary (which is very CPU intensive). I have an Excel sheet of 525,000 rows and it dominates my 8-core machine's CPU at 100% for 15-20 minutes just to figure out the PERCENTRANK function for a single column.

One way to think of this is, "the percentage of Students with Scores below this one."
Here is one way to get that type of percentile in SQL Server, using RANK():
select *
, (rank() over (order by Score) - 1.0) / (select count(*) from #temp) * 100 as PercentileRank
from #temp
Note that this will always be less than 100% unless you round up, and you will always get 0% for the lowest value(s). This does not necessarily put the median value at 50%, nor will it interpolate like some percentile calculations do.
Feel free to round or cast the whole expression (e.g. cast(... as decimal(4,2))) for good looking reports, or even replace - 1.0 with - 1e to force floating point calculation.
NTILE() isn't really what you're looking for in this case because it essentially divides the row numbers of an ordered set into groups rather than the values. It will assign a different percentile to two instances of the same value if those instances happen to straddle a crossover point. You'd have to then additionally group by that value and grab the max or min percentile of the group to use NTILE() in the same way as we're doing with RANK().

Is there a typo?
select NTILE(100) OVER (order by Score) PercentileRank
from #temp
And your script looks good. If you think something wrong there, could you clarify what excactly?

There is an issue with your code as NTILE distribution is not uniform. If you have 213 students, the top most 13 groups would have 3 students and the latter 87 would have 2 students each. This is not what you would ideally want in a percentile distribution.
You might want to use RANK/ROWNUM and then divide to get the %ile group.

I know this is an old thread but there's certainly a lot of misinformation about this topic making it's way around the internet.
NTILE is not designed for calculating percentile rank (AKA percent rank)
If you are using NTILE to calculate Percent Rank you are doing it wrong. Anyone who tells you otherwise is misinformed and mistaken. If you are using NTILE(100) and getting the correct answer its purely coincidental.
Tim Lehner explained the problem perfectly.
"It will assign a different percentile to two instances of the same value if those instances happen to straddle a crossover point."
In other words, using NTILE to calculate where students rank based on their test scores can result in two students with the exact same test scores receiving different percent rank values. Conversely, two students with different scores can receive the same percent rank.
For a more verbose explanation of why NTILE is the wrong tool for this job as well as well as a profoundly better performing alternative to percent_rank see: Nasty Fast PERCENT_RANK.
http://www.sqlservercentral.com/articles/PERCENT_RANK/141532/

Related

Table join with multiple conditions

I'm having trouble to give the condition for tables' joining. The highlight parts are the 3 conditions that I need to solve. Basically, there are some securities that for their effective term if the value is between 0 to 2 it has score 1, if the value is between 2 to 10, it has score 2, and if the value is bigger than 10 it has value 4.
For the first two conditions, in the query's where part I solve them like this
however for the third condition if the Descriptsec is empty I'm not quite sure what can I do, can anyone help?
Can you change the lookup table ([Risk].[dbo].[FILiquidityBuckets]) you are using?
If yes, do this:
Add bounds so that table looks like this:
Metric-DescriptLowerBound-DescriptUpperBound-LiquidityScore
Effective term-0-2-1
Effective term-2-10-2
Effective term-10-9999999(some absurd high number)-4
Then your join condition can be this:
ON FB3.Metric='Effective term'
AND CAST(sa.effectiveTerm AS INT) BETWEEN CAST(FB3.DescriptLowerBound AS INT)
AND CAST(FB3.DescriptLowerBound AS INT)
Please note that BETWEEN is inclusive so in the edge cases (where the value is exactly 2 or 10), the lower score will be captured.
I can see some problems: the effective term in table with sa alias is a float. So you should consider rounding up or down.
Overall a lot of things can be changed/improved but I tried to offer an immediate solution.
Hope this helps.

How do I name columns that contain restricted characters so that that name is intuitive?

I have to make a database column that stores the number of people who scored greater than or equal to a threshold of 0.01, 0.1, 1, and 10 at a particular metric I'm tracking.
For example, I want to store the following data
Number units quality score >= 0.01
Number units quality score >= 0.1
Number units quality score >= 1
Number units quality score >= 10
So my question is...
How do I name my columns to communicate this? I can't use the >= symbol or the . charater. I was thinking something like...
gte_score_001
gte_score_01
gte_score_1
gte_score_10
Does this seem intuitive? Is there a better way to do this?
It's subjective, but I'd say score_gte_001 would be slightly more intuitive. meets_thresh_001 would be another option that may be slightly clearer than gte.
Then there's the numbers. Avoid the decimal point problem by refering to the numbers either explicitly or implicitly as hundreths:
meets_thresh_1c
meets_thresh_10c
meets_thresh_100c
meets_thresh_1000c
Your examples are all either integers or less than zero, so there isn't any ambiguity; but that itself might not be obvious to someone else looking at the name, and you could have to add more confusing or even conflicting groups. What would gte_score_15 mean - it could be read as >= 15 or >= 1.5? And you might need to represent both one day, so your naming should try to be future-proof as well an intuitive.
Including a delimiter to show where the decimal point goes would make it clearer, at least once you know the scheme. To me it makes sense to use the numeric format model character for the decimal separator, D:
gte_score_0d01
gte_score_0d1
gte_score_1
gte_score_1d5
gte_score_10
gte_score_15
though I agree with #L.ScottJohnson that score_gte_0d01 etc. scans better. Again, it's subjective.
If there is a maximum value for the metric, and a a maximum precision, it might be worth including leading zeros and trailing zeros. Say if it can never be more than two digits, and no more than two decimal places:
score_gte_00d01
score_gte_00d10
score_gte_01d00
score_gte_01d50
score_gte_10d00
score_gte_15d00
The delimiter is kind of redundant as long as you know the pattern - without it, the threshhold is the numeric part/100. But it's maybe clearer with it, and with the padding it's probably even more obvious what the d represents.
If you go down this route then I'd suggest you come up with a scheme that makes sense to you, then show it to colleagues and see if they can interpret it without any hints.
You could (and arguably should) normalise your design instead, into a separate table which has a column for the threshold (which can then be a simple number) and another column for the corresponding number of people for each metric. That makes it easier to add more thresholds later (easier to add extra rows an an extra column) and makes the problem go away. (You could add a view to pivot into this layout, but then you're back to your naming problem.)
Do you really want to do that? What if it turns out that you need additional threshold of 20. Or 75. Or ...? Why wouldn't you normalize it and use two columns? One that represents a threshold, and another one that represents the number of people.
So, instead of
create table some_table
(<some_columns_here>,
gte_score_001 number,
gte_score_01 number,
gte_score_1 number,
gte_score_10 number
);
use
create table some_table
(<some_columns_here>,
threshold number,
number_of_people number
);
comment on column some_table.number_of_people is
'Represents number of people who scored greater than or equal to a threshold';
and store values like
insert into some_table (threshold, number_of_people) values (0.01, 13);
insert into some_table (threshold, number_of_people) values (0.1, 56);
insert into some_table (threshold, number_of_people) values (1, 7);
If you have to implement a new threshold value, no problem - just insert it as
insert into some_table (threshold, number_of_people) values (75, 24);
Doing your way, you'd have to
alter table some_table add gte_score_75 number;
and - what's even worse - modify all other program units that reference that table (stored procedures, views, forms, reports ... the list is quite long).
Storing the number of people above a threshold will cause a problem no matter which of these solutions you use. Instead, use a view of the raw data. If you need a new split, you create a new view, you don't have to recalculate any columns. This example is not efficient, but is provided as an example:
create or replace view thresholds as
select
(select count(*) c from rawdata where score >= .1) as ".1"
, (select count(*) c from rawdata where score >= 1 ) as "1"
, (select count(*) c from rawdata where score >= 10 ) as "10"
from dual
Double quotes around the column name alias will let you get away with a lot of things. That removes most limitations on restricted characters and keywords.
I suggest something like this ought to work for display purposes.
SELECT 314 "Number units quality score >= 0.01" FROM DUAL

Multiple IF QUARTILEs returning wrong values

I am using a nested IF statement within a Quartile wrapper, and it only kind of works, for the most part because it's returning values that are slightly off from what I would have expected if I calculate the range of values manually.
I've looked around but most of the posts and research is about designing the fomrula, I haven't come across anything compelling in terms of this odd behaviour I'm observing.
My formula (ctrl+shift enter as it's an array): =QUARTILE(IF(((F2:$F$10=$W$4)($Q$2:$Q$10=$W$3))($E$2:$E$10=W$2),IF($O$2:$O$10<>"",$O$2:$O$10)),1)
The full dataset:
0.868997877*
0.99480118
0.867040346*
0.914032128*
0.988150438
0.981207615*
0.986629288
0.984750004*
0.988983643*
*The formula has 3 AND conditions that need to be met and should return range:
0.868997877
0.867040346
0.914032128
0.981207615
0.984750004
0.988983643
At which 25% is calculated based on the range.
If I take the output from the formula, 25%-ile (QUARTILE,1) is 0.8803, but if I calculate it manually based on the data points right above, it comes out to 0.8685 and I can't see why.
I feel it's because the IF statements identifies slight off range but the values that meet the IF statements are different rows or something.
If you look at the table here you can see that there is more than one way of estimating quartile (or other percentile) from a sample and Excel has two. The one you are doing by hand must be like Quartile.exc and the one you are using in the formula is like Quartile.inc
Basically both formulas work out the rank of the quartile value. If it isn't an integer it interpolates (e.g. if it was 1.5, that means the quartile lies half way between the first and second numbers in ascending order). You might think that there wouldn't be much difference, but for small samples there is a massive difference:
Quartile.exc Rank=(N+1)/4
Quartile.inc Rank=(N+3)/4
Here's how it would look with your data

TSQL - Do you calculate values then sum, or sum first then calculate values?

I feel stupid asking this - there is probably a math rule I am forgetting.
I am trying to calculate a gross profit based on net sales, cost, and billbacks.
I get two different values based on how I do the calculation:
(sum(netsales) - sum(cost)) + sum(billbackdollars) as CalculateOutsideSum,
sum((netsales - cost) + BillBackDollars) as CalculateWithinSum
This is coming off of a basic transaction fact table.
In this particular example, there are about 90 records being summed, and I get the following results
CalculateOutsideSum: 234.77
CalculateWithinSum: 247.70
I imagined this would be some sort of transitive property and both results would be the same considering it's just summation.
Which method is correct?
From a mathematical point of view, you should get exactly the same value with both your formulas.
Anyway in this cases it's better to performs sum after any calculation.
EDIT AFTER OPENER RESPONSE:
And treat your data with isnull function or other casting function which increases data precision.
Rounding, formatting and castings which decreases data precision should be applied after sums.
Just figured it out...
Problem was Net Sales was null for 3 rows, causing the calculation to become null, and incorrectly summing. After adding an isnull, both sums come out the same.

Calculating Moving Range in SQL Server (without arrays)

I have a requirement to calculate the Moving Range of a load of data (at least I think this is what it is called) in SQL Server. This would be easy if I could use arrays, but I understand this is not possible for MS SQL, so wonder if anyone had a suggestion.
To give you an idea of what I need:
Lets say I have the following in a sql server table:
1
3
2
6
3
I need to get the difference of each of these numbers (in order), ie:
|1-3|=2
|3-2|=1
|6-2|=4
|3-6|=3
Then square these:
2^2=4
1^2=1
4^2=16
3^2=9
EDIT: PROBABLY WORTH NOTING THAT YOU DO NOT SQUARE THESE FOR MOVING AVERAGE - I WAS WRONG
Then sum them:
4+1+16+9=30
Then divide by number of values:
30/5=6
Then square root this:
2.5(ish)
EDIT: BECAUSE YOU ARENT SQUARING THEM, YOU ARENT SQROOTING THEM EITHER
If anyone can just help me out with the first step, that would be great - I can do the rest myself.
A few other things to take into account:
- Using stored procedure in SQL Server
- There is quite a lot of data (100s or 1000s of values), and they will need to be calulated daily or weekly
Many thanks in advance.
~Bob
WITH nums AS
(
SELECT num, ROW_NUMBER() OVER (ORDER BY id) AS rn
FROM mytable
)
SELECT SQRT(AVG(POWER(tp.num - tf.num, 2)))
FROM nums tp
JOIN nums tf
ON tf.rn = tp.rn + 1

Resources