Table join with multiple conditions - sql-server

I'm having trouble to give the condition for tables' joining. The highlight parts are the 3 conditions that I need to solve. Basically, there are some securities that for their effective term if the value is between 0 to 2 it has score 1, if the value is between 2 to 10, it has score 2, and if the value is bigger than 10 it has value 4.
For the first two conditions, in the query's where part I solve them like this
however for the third condition if the Descriptsec is empty I'm not quite sure what can I do, can anyone help?

Can you change the lookup table ([Risk].[dbo].[FILiquidityBuckets]) you are using?
If yes, do this:
Add bounds so that table looks like this:
Metric-DescriptLowerBound-DescriptUpperBound-LiquidityScore
Effective term-0-2-1
Effective term-2-10-2
Effective term-10-9999999(some absurd high number)-4
Then your join condition can be this:
ON FB3.Metric='Effective term'
AND CAST(sa.effectiveTerm AS INT) BETWEEN CAST(FB3.DescriptLowerBound AS INT)
AND CAST(FB3.DescriptLowerBound AS INT)
Please note that BETWEEN is inclusive so in the edge cases (where the value is exactly 2 or 10), the lower score will be captured.
I can see some problems: the effective term in table with sa alias is a float. So you should consider rounding up or down.
Overall a lot of things can be changed/improved but I tried to offer an immediate solution.
Hope this helps.

Related

How to get correct values without checking the datatype of each column

I am writing lots of queries to use in reports and statistics,
often I need average values, so I use avg(column) but that has a problem
look at this example
declare #table table (value int)
insert into #table values (2), (3)
select avg(t.value),
avg(convert(decimal(16,2), t.value)),
avg(t.value/1.0)
from #table t
the result will be
2 2.5 2.5
The first value 2 is wrong. The average of 2 and 3 is not 2, but 2.5
Now I know this is because column Value is of type int and because of this the result is also returned in an int.
But that just is not practical. I never have any use for that, I always need to convert. I cannot think of any case where I want the avg of 2 and 3 to be 2.
So what is my problem,
I have many queries to write on lots of tables and columns, if I need to check for each column what the datatype is and write convert where needed, that not only takes a lot of time, but is is very dangerous, it is so easy to forget one which will introduce a bug in the report.
Is there a way to write this without having to convert first ?
Do I have to always write value/1.0 no matter what the datatype is ?
This is something that is so easy to forget, mistakes are made so fast.
Also, can this affect performance if I always have to divide all values before calling a function ?
What is the best way of doing this ?

I wish to put this condition in a more sargable way

In my sql server 2017 standard CU 20 there is a very often excuted query that has this awful condition:
AND F.progressive_invoice % #numberofservicesInstalled = #idService
Is there a mathematical way to put it in a more convenient way for sql server?
The query lasts from 240 to 500 ms
Can you help me to do better? Please
What do you think is particularly awful here?
Is this query performing badly? Are you sure, that this condition is responsible?
This % is the modulo operator.
SELECT 13 % 5 => Remainder is 3
--This is roughly the same your code is doing:
DECLARE #Divisor INT=5; --Switch this value
DECLARE #CompareRemainder INT=3;
SELECT CASE WHEN 13 % #Divisor = #CompareRemainder THEN 'Remainder matches variable' ELSE 'no match' END;
Your line of code will tell the engine to compute a integer division of F.progressive_invoice and the variable #numberofservicesInstalled, then pick the remainder. This computation's result is compared to the variable #idService.
As this computation must be done for each value, an index will not help here...
I do not think, that this can be put more sargable.
UPDATE
In a comment you suggest, that it might help to change the code behind the equal operator. No this will not help.
I tried to think of a senseful meaning of this... Is the number of a service (or - as the variable suggests - its id) somehow hidden in the invoice number?
execution plan and row estimation:
The engine will see, that this must be computed for all rows. It would help to enforce any other filter before you have to do this. But you do not show enough. The line of code we see is just one part of a condition...
Indexes and statistics will surely play their roles too...
The short answer to your direct question is no. You cannot rearrange this expression to make it a sargable predicate. At least, not with a construct which is finite and agnostic of the possible values of #numberofservicesinstalled.

How do I name columns that contain restricted characters so that that name is intuitive?

I have to make a database column that stores the number of people who scored greater than or equal to a threshold of 0.01, 0.1, 1, and 10 at a particular metric I'm tracking.
For example, I want to store the following data
Number units quality score >= 0.01
Number units quality score >= 0.1
Number units quality score >= 1
Number units quality score >= 10
So my question is...
How do I name my columns to communicate this? I can't use the >= symbol or the . charater. I was thinking something like...
gte_score_001
gte_score_01
gte_score_1
gte_score_10
Does this seem intuitive? Is there a better way to do this?
It's subjective, but I'd say score_gte_001 would be slightly more intuitive. meets_thresh_001 would be another option that may be slightly clearer than gte.
Then there's the numbers. Avoid the decimal point problem by refering to the numbers either explicitly or implicitly as hundreths:
meets_thresh_1c
meets_thresh_10c
meets_thresh_100c
meets_thresh_1000c
Your examples are all either integers or less than zero, so there isn't any ambiguity; but that itself might not be obvious to someone else looking at the name, and you could have to add more confusing or even conflicting groups. What would gte_score_15 mean - it could be read as >= 15 or >= 1.5? And you might need to represent both one day, so your naming should try to be future-proof as well an intuitive.
Including a delimiter to show where the decimal point goes would make it clearer, at least once you know the scheme. To me it makes sense to use the numeric format model character for the decimal separator, D:
gte_score_0d01
gte_score_0d1
gte_score_1
gte_score_1d5
gte_score_10
gte_score_15
though I agree with #L.ScottJohnson that score_gte_0d01 etc. scans better. Again, it's subjective.
If there is a maximum value for the metric, and a a maximum precision, it might be worth including leading zeros and trailing zeros. Say if it can never be more than two digits, and no more than two decimal places:
score_gte_00d01
score_gte_00d10
score_gte_01d00
score_gte_01d50
score_gte_10d00
score_gte_15d00
The delimiter is kind of redundant as long as you know the pattern - without it, the threshhold is the numeric part/100. But it's maybe clearer with it, and with the padding it's probably even more obvious what the d represents.
If you go down this route then I'd suggest you come up with a scheme that makes sense to you, then show it to colleagues and see if they can interpret it without any hints.
You could (and arguably should) normalise your design instead, into a separate table which has a column for the threshold (which can then be a simple number) and another column for the corresponding number of people for each metric. That makes it easier to add more thresholds later (easier to add extra rows an an extra column) and makes the problem go away. (You could add a view to pivot into this layout, but then you're back to your naming problem.)
Do you really want to do that? What if it turns out that you need additional threshold of 20. Or 75. Or ...? Why wouldn't you normalize it and use two columns? One that represents a threshold, and another one that represents the number of people.
So, instead of
create table some_table
(<some_columns_here>,
gte_score_001 number,
gte_score_01 number,
gte_score_1 number,
gte_score_10 number
);
use
create table some_table
(<some_columns_here>,
threshold number,
number_of_people number
);
comment on column some_table.number_of_people is
'Represents number of people who scored greater than or equal to a threshold';
and store values like
insert into some_table (threshold, number_of_people) values (0.01, 13);
insert into some_table (threshold, number_of_people) values (0.1, 56);
insert into some_table (threshold, number_of_people) values (1, 7);
If you have to implement a new threshold value, no problem - just insert it as
insert into some_table (threshold, number_of_people) values (75, 24);
Doing your way, you'd have to
alter table some_table add gte_score_75 number;
and - what's even worse - modify all other program units that reference that table (stored procedures, views, forms, reports ... the list is quite long).
Storing the number of people above a threshold will cause a problem no matter which of these solutions you use. Instead, use a view of the raw data. If you need a new split, you create a new view, you don't have to recalculate any columns. This example is not efficient, but is provided as an example:
create or replace view thresholds as
select
(select count(*) c from rawdata where score >= .1) as ".1"
, (select count(*) c from rawdata where score >= 1 ) as "1"
, (select count(*) c from rawdata where score >= 10 ) as "10"
from dual
Double quotes around the column name alias will let you get away with a lot of things. That removes most limitations on restricted characters and keywords.
I suggest something like this ought to work for display purposes.
SELECT 314 "Number units quality score >= 0.01" FROM DUAL

Multiple IF QUARTILEs returning wrong values

I am using a nested IF statement within a Quartile wrapper, and it only kind of works, for the most part because it's returning values that are slightly off from what I would have expected if I calculate the range of values manually.
I've looked around but most of the posts and research is about designing the fomrula, I haven't come across anything compelling in terms of this odd behaviour I'm observing.
My formula (ctrl+shift enter as it's an array): =QUARTILE(IF(((F2:$F$10=$W$4)($Q$2:$Q$10=$W$3))($E$2:$E$10=W$2),IF($O$2:$O$10<>"",$O$2:$O$10)),1)
The full dataset:
0.868997877*
0.99480118
0.867040346*
0.914032128*
0.988150438
0.981207615*
0.986629288
0.984750004*
0.988983643*
*The formula has 3 AND conditions that need to be met and should return range:
0.868997877
0.867040346
0.914032128
0.981207615
0.984750004
0.988983643
At which 25% is calculated based on the range.
If I take the output from the formula, 25%-ile (QUARTILE,1) is 0.8803, but if I calculate it manually based on the data points right above, it comes out to 0.8685 and I can't see why.
I feel it's because the IF statements identifies slight off range but the values that meet the IF statements are different rows or something.
If you look at the table here you can see that there is more than one way of estimating quartile (or other percentile) from a sample and Excel has two. The one you are doing by hand must be like Quartile.exc and the one you are using in the formula is like Quartile.inc
Basically both formulas work out the rank of the quartile value. If it isn't an integer it interpolates (e.g. if it was 1.5, that means the quartile lies half way between the first and second numbers in ascending order). You might think that there wouldn't be much difference, but for small samples there is a massive difference:
Quartile.exc Rank=(N+1)/4
Quartile.inc Rank=(N+3)/4
Here's how it would look with your data

Calculate Percentile Rank using NTILE?

Need to calculate the percentile rank (1st - 99th percentile) for each student with a score for a single test.
I'm a little confused by the msdn definition of NTILE, because it does not explicitly mention percentile rank. I need some sort of assurance that NTILE is the correct keyword to use for calculating percentile rank.
declare #temp table
(
StudentId int,
Score int
)
insert into #temp
select 1, 20
union
select 2, 25
.....
select NTILE(100) OVER (order by Score) PercentileRank
from #temp
It looks correct to me, but is this the correct way to calculate percentile rank?
NTILE is absolutely NOT the same as percentile rank. NTILE simply divides up a set of data evenly by the number provided (as noted by RoyiNamir above). If you chart the results of both functions, NTILE will be a perfectly linear line from 1-to-n, whereas percentile rank will [usually] have some curves to it depending on your data.
Percentile rank is much more complicated than simply dividing it up by N. It then takes each row's number and figures out where in the distribution it lies, interpolating when necessary (which is very CPU intensive). I have an Excel sheet of 525,000 rows and it dominates my 8-core machine's CPU at 100% for 15-20 minutes just to figure out the PERCENTRANK function for a single column.
One way to think of this is, "the percentage of Students with Scores below this one."
Here is one way to get that type of percentile in SQL Server, using RANK():
select *
, (rank() over (order by Score) - 1.0) / (select count(*) from #temp) * 100 as PercentileRank
from #temp
Note that this will always be less than 100% unless you round up, and you will always get 0% for the lowest value(s). This does not necessarily put the median value at 50%, nor will it interpolate like some percentile calculations do.
Feel free to round or cast the whole expression (e.g. cast(... as decimal(4,2))) for good looking reports, or even replace - 1.0 with - 1e to force floating point calculation.
NTILE() isn't really what you're looking for in this case because it essentially divides the row numbers of an ordered set into groups rather than the values. It will assign a different percentile to two instances of the same value if those instances happen to straddle a crossover point. You'd have to then additionally group by that value and grab the max or min percentile of the group to use NTILE() in the same way as we're doing with RANK().
Is there a typo?
select NTILE(100) OVER (order by Score) PercentileRank
from #temp
And your script looks good. If you think something wrong there, could you clarify what excactly?
There is an issue with your code as NTILE distribution is not uniform. If you have 213 students, the top most 13 groups would have 3 students and the latter 87 would have 2 students each. This is not what you would ideally want in a percentile distribution.
You might want to use RANK/ROWNUM and then divide to get the %ile group.
I know this is an old thread but there's certainly a lot of misinformation about this topic making it's way around the internet.
NTILE is not designed for calculating percentile rank (AKA percent rank)
If you are using NTILE to calculate Percent Rank you are doing it wrong. Anyone who tells you otherwise is misinformed and mistaken. If you are using NTILE(100) and getting the correct answer its purely coincidental.
Tim Lehner explained the problem perfectly.
"It will assign a different percentile to two instances of the same value if those instances happen to straddle a crossover point."
In other words, using NTILE to calculate where students rank based on their test scores can result in two students with the exact same test scores receiving different percent rank values. Conversely, two students with different scores can receive the same percent rank.
For a more verbose explanation of why NTILE is the wrong tool for this job as well as well as a profoundly better performing alternative to percent_rank see: Nasty Fast PERCENT_RANK.
http://www.sqlservercentral.com/articles/PERCENT_RANK/141532/

Resources