SSAS Standard deviation calculation - sql-server

I am building a little cube and have a problem with creating the calculations.
All in all I want some values based on the Plugin.
As an example I want the standard deviation of the execution time.
Something like this:
SELECT Plugin.PluginId, AVG(Task.ExecutionTimeMs) AS Mean
, STDEVP(Task.ExecutionTimeMs) AS [Standard Deviation]
, STDEVP(Task.ExecutionTimeMs) * STDEVP(Task.ExecutionTimeMs) AS Variance
FROM Task
In my Analysis Project I created an calculation with the following expression:
STDEVP( [Finished Tasks], [Measures].[Execution Time Ms Sum] )
which didn't work.
I tried some other functions (MAX,AVG) but none worked as intended so I'm obviously doing something wrong.
What is the correct way to create a such measures?

Assuming you are talking about SSAS Multidimensional then StDevP MDX is a pretty expensive function to use in your case because it has to calculate at the leaf level (fact table row level or Task level in your case). If you have more than a couple thousand tasks I would recommend an optimization which performs well and gets the right number.
This idea is from this thread.
load in a simple SUM measure (x) - the sum of fact column (aggregated in the cube)
load in a simple SUM measure of x squared (x2) - sum of square of fact column (aggregated in the cube)
load in a counter called cnt - count of a fact column (aggregated in the cube)
[The MDX calculation would then be:]
((x2 - ((x^2)/cnt))/cnt)^0.5

Related

Multiple IF QUARTILEs returning wrong values

I am using a nested IF statement within a Quartile wrapper, and it only kind of works, for the most part because it's returning values that are slightly off from what I would have expected if I calculate the range of values manually.
I've looked around but most of the posts and research is about designing the fomrula, I haven't come across anything compelling in terms of this odd behaviour I'm observing.
My formula (ctrl+shift enter as it's an array): =QUARTILE(IF(((F2:$F$10=$W$4)($Q$2:$Q$10=$W$3))($E$2:$E$10=W$2),IF($O$2:$O$10<>"",$O$2:$O$10)),1)
The full dataset:
0.868997877*
0.99480118
0.867040346*
0.914032128*
0.988150438
0.981207615*
0.986629288
0.984750004*
0.988983643*
*The formula has 3 AND conditions that need to be met and should return range:
0.868997877
0.867040346
0.914032128
0.981207615
0.984750004
0.988983643
At which 25% is calculated based on the range.
If I take the output from the formula, 25%-ile (QUARTILE,1) is 0.8803, but if I calculate it manually based on the data points right above, it comes out to 0.8685 and I can't see why.
I feel it's because the IF statements identifies slight off range but the values that meet the IF statements are different rows or something.
If you look at the table here you can see that there is more than one way of estimating quartile (or other percentile) from a sample and Excel has two. The one you are doing by hand must be like Quartile.exc and the one you are using in the formula is like Quartile.inc
Basically both formulas work out the rank of the quartile value. If it isn't an integer it interpolates (e.g. if it was 1.5, that means the quartile lies half way between the first and second numbers in ascending order). You might think that there wouldn't be much difference, but for small samples there is a massive difference:
Quartile.exc Rank=(N+1)/4
Quartile.inc Rank=(N+3)/4
Here's how it would look with your data

TSQL - Do you calculate values then sum, or sum first then calculate values?

I feel stupid asking this - there is probably a math rule I am forgetting.
I am trying to calculate a gross profit based on net sales, cost, and billbacks.
I get two different values based on how I do the calculation:
(sum(netsales) - sum(cost)) + sum(billbackdollars) as CalculateOutsideSum,
sum((netsales - cost) + BillBackDollars) as CalculateWithinSum
This is coming off of a basic transaction fact table.
In this particular example, there are about 90 records being summed, and I get the following results
CalculateOutsideSum: 234.77
CalculateWithinSum: 247.70
I imagined this would be some sort of transitive property and both results would be the same considering it's just summation.
Which method is correct?
From a mathematical point of view, you should get exactly the same value with both your formulas.
Anyway in this cases it's better to performs sum after any calculation.
EDIT AFTER OPENER RESPONSE:
And treat your data with isnull function or other casting function which increases data precision.
Rounding, formatting and castings which decreases data precision should be applied after sums.
Just figured it out...
Problem was Net Sales was null for 3 rows, causing the calculation to become null, and incorrectly summing. After adding an isnull, both sums come out the same.

How can I define a calculation in SSAS for counting greater than zero values?

I have a measure named [Measures].[Qty Back Sale] in a cube in SQL Server Analysis Services. It is the Sum of column Qty_BackSale. Now I want to define a measure that calculates the number of rows that Qty_BackSale is greater than zero.
How can I calculate it?
The easiest way would be to add a named calculation to the fact table in the data source view. You would use an expression like case when Qty_BackSale > 0 then 1 else 0 end. Then just define a new measure based on this with the standard AggregateFunction Sum.
No need to store the zero, else it encourages dense data.

Calculate Percentile Rank using NTILE?

Need to calculate the percentile rank (1st - 99th percentile) for each student with a score for a single test.
I'm a little confused by the msdn definition of NTILE, because it does not explicitly mention percentile rank. I need some sort of assurance that NTILE is the correct keyword to use for calculating percentile rank.
declare #temp table
(
StudentId int,
Score int
)
insert into #temp
select 1, 20
union
select 2, 25
.....
select NTILE(100) OVER (order by Score) PercentileRank
from #temp
It looks correct to me, but is this the correct way to calculate percentile rank?
NTILE is absolutely NOT the same as percentile rank. NTILE simply divides up a set of data evenly by the number provided (as noted by RoyiNamir above). If you chart the results of both functions, NTILE will be a perfectly linear line from 1-to-n, whereas percentile rank will [usually] have some curves to it depending on your data.
Percentile rank is much more complicated than simply dividing it up by N. It then takes each row's number and figures out where in the distribution it lies, interpolating when necessary (which is very CPU intensive). I have an Excel sheet of 525,000 rows and it dominates my 8-core machine's CPU at 100% for 15-20 minutes just to figure out the PERCENTRANK function for a single column.
One way to think of this is, "the percentage of Students with Scores below this one."
Here is one way to get that type of percentile in SQL Server, using RANK():
select *
, (rank() over (order by Score) - 1.0) / (select count(*) from #temp) * 100 as PercentileRank
from #temp
Note that this will always be less than 100% unless you round up, and you will always get 0% for the lowest value(s). This does not necessarily put the median value at 50%, nor will it interpolate like some percentile calculations do.
Feel free to round or cast the whole expression (e.g. cast(... as decimal(4,2))) for good looking reports, or even replace - 1.0 with - 1e to force floating point calculation.
NTILE() isn't really what you're looking for in this case because it essentially divides the row numbers of an ordered set into groups rather than the values. It will assign a different percentile to two instances of the same value if those instances happen to straddle a crossover point. You'd have to then additionally group by that value and grab the max or min percentile of the group to use NTILE() in the same way as we're doing with RANK().
Is there a typo?
select NTILE(100) OVER (order by Score) PercentileRank
from #temp
And your script looks good. If you think something wrong there, could you clarify what excactly?
There is an issue with your code as NTILE distribution is not uniform. If you have 213 students, the top most 13 groups would have 3 students and the latter 87 would have 2 students each. This is not what you would ideally want in a percentile distribution.
You might want to use RANK/ROWNUM and then divide to get the %ile group.
I know this is an old thread but there's certainly a lot of misinformation about this topic making it's way around the internet.
NTILE is not designed for calculating percentile rank (AKA percent rank)
If you are using NTILE to calculate Percent Rank you are doing it wrong. Anyone who tells you otherwise is misinformed and mistaken. If you are using NTILE(100) and getting the correct answer its purely coincidental.
Tim Lehner explained the problem perfectly.
"It will assign a different percentile to two instances of the same value if those instances happen to straddle a crossover point."
In other words, using NTILE to calculate where students rank based on their test scores can result in two students with the exact same test scores receiving different percent rank values. Conversely, two students with different scores can receive the same percent rank.
For a more verbose explanation of why NTILE is the wrong tool for this job as well as well as a profoundly better performing alternative to percent_rank see: Nasty Fast PERCENT_RANK.
http://www.sqlservercentral.com/articles/PERCENT_RANK/141532/

Efficient comparison of 1 million vectors containing (float, integer) tuples

I am working in a chemistry/biology project. We are building a web-application for fast matching of the user's experimental data with predicted data in a reference database. The reference database will contain up to a million entries. The data for one entry is a list (vector) of tuples containing a float value between 0.0 and 20.0 and an integer value between 1 and 18. For instance (7.2394 , 2) , (7.4011, 1) , (9.9367, 3) , ... etc.
The user will enter a similar list of tuples and the web-app must then return the - let's say - top 50 best matching database entries.
One thing is crucial: the search algorithm must allow for discrepancies between the query data and the reference data because both can contain small errors in the float values (NOT in the integer values). (The query data can contain errors because it is derived from a real-life experiment and the reference data because it is the result of a prediction.)
Edit - Moved text to answer -
How can we get an efficient ranking of 1 query on 1 million records?
You should add a physicist to the project :-) This is a very common problem to compare functions e.g. look here:
http://en.wikipedia.org/wiki/Autocorrelation
http://en.wikipedia.org/wiki/Correlation_function
In the first link you can read: "The SEQUEST algorithm for analyzing mass spectra makes use of autocorrelation in conjunction with cross-correlation to score the similarity of an observed spectrum to an idealized spectrum representing a peptide."
An efficient linear scan of 1 million records of that type should take a fraction of a second on a modern machine; a compiled loop should be able to do it at about memory bandwidth, which would transfer that in a two or three milliseconds.
But, if you really need to optimise this, you could construct a hash table of the integer values, which would divide the job by the number of integer bins. And, if the data is stored sorted by the floats, that improves the locality of matching by those; you know you can stop once you're out of tolerance. Storing the offsets of each of a number of bins would give you a position to start.
I guess I don't see the need for a fancy algorithm yet... describe the problem a bit more, perhaps (you can assume a fairly high level of chemistry and physics knowledge if you like; I'm a physicist by training)?
Ok, given the extra info, I still see no need for anything better than a direct linear search, if there's only 1 million reference vectors and the algorithm is that simple. I just tried it, and even a pure Python implementation of linear scan took only around three seconds. It took several times longer to make up some random data to test with. This does somewhat depend on the rather lunatic level of optimisation in Python's sorting library, but that's the advantage of high level languages.
from cmath import *
import random
r = [(random.uniform(0,20), random.randint(1,18)) for i in range(1000000)]
# this is a decorate-sort-undecorate pattern
# look for matches to (7,9)
# obviously, you can use whatever distance expression you want
zz=[(abs((7-x)+(9-y)),x,y) for x,y in r]
zz.sort()
# return the 50 best matches
[(x,y) for a,x,y in zz[:50]]
Can't you sort the tuples and perform binary search on the sorted array ?
I assume your database is done once for all, and the positions of the entries is not important. You can sort this array so that the tuples are in a given order. When a tuple is entered by the user, you just look in the middle of the sorted array. If the query value is larger of the center value, you repeat the work on the upper half, otherwise on the lower one.
Worst case is log(n)
If you can "map" your reference data to x-y coordinates on a plane there is a nifty technique which allows you to select all points under a given distance/tolerance (using Hilbert curves).
Here is a detailed example.
One approach we are trying ourselves which allows for the discrepancies between query and reference is by binning the float values. We are testing and want to offer the user the choice of different bin sizes. Bin sizes will be 0.1 , 0.2 , 0.3 or 0.4. So binning leaves us with between 50 and 200 bins, each with a corresponding integer value between 0 and 18, where 0 means there was no value within that bin. The reference data can be pre-binned and stored in the database. We can then take the binned query data and compare it with the reference data. One approach could be for all bins, subtract the query integer value from the reference integer value. By summing up all differences we get the similarity score, with the the most similar reference entries resulting in the lowest scores.
Another (simpler) search option we want to offer is where the user only enters the float values. The integer values in both query as reference list can then be set to 1. We then use Hamming distance to compute the difference between the query and the reference binned values. I have previously asked about an efficient algorithm for that search.
This binning is only one way of achieving our goal. I am open to other suggestions. Perhaps we can use Principal Component Analysis (PCA), as described here

Resources