I have a requirement to calculate the Moving Range of a load of data (at least I think this is what it is called) in SQL Server. This would be easy if I could use arrays, but I understand this is not possible for MS SQL, so wonder if anyone had a suggestion.
To give you an idea of what I need:
Lets say I have the following in a sql server table:
1
3
2
6
3
I need to get the difference of each of these numbers (in order), ie:
|1-3|=2
|3-2|=1
|6-2|=4
|3-6|=3
Then square these:
2^2=4
1^2=1
4^2=16
3^2=9
EDIT: PROBABLY WORTH NOTING THAT YOU DO NOT SQUARE THESE FOR MOVING AVERAGE - I WAS WRONG
Then sum them:
4+1+16+9=30
Then divide by number of values:
30/5=6
Then square root this:
2.5(ish)
EDIT: BECAUSE YOU ARENT SQUARING THEM, YOU ARENT SQROOTING THEM EITHER
If anyone can just help me out with the first step, that would be great - I can do the rest myself.
A few other things to take into account:
- Using stored procedure in SQL Server
- There is quite a lot of data (100s or 1000s of values), and they will need to be calulated daily or weekly
Many thanks in advance.
~Bob
WITH nums AS
(
SELECT num, ROW_NUMBER() OVER (ORDER BY id) AS rn
FROM mytable
)
SELECT SQRT(AVG(POWER(tp.num - tf.num, 2)))
FROM nums tp
JOIN nums tf
ON tf.rn = tp.rn + 1
Related
In my sql server 2017 standard CU 20 there is a very often excuted query that has this awful condition:
AND F.progressive_invoice % #numberofservicesInstalled = #idService
Is there a mathematical way to put it in a more convenient way for sql server?
The query lasts from 240 to 500 ms
Can you help me to do better? Please
What do you think is particularly awful here?
Is this query performing badly? Are you sure, that this condition is responsible?
This % is the modulo operator.
SELECT 13 % 5 => Remainder is 3
--This is roughly the same your code is doing:
DECLARE #Divisor INT=5; --Switch this value
DECLARE #CompareRemainder INT=3;
SELECT CASE WHEN 13 % #Divisor = #CompareRemainder THEN 'Remainder matches variable' ELSE 'no match' END;
Your line of code will tell the engine to compute a integer division of F.progressive_invoice and the variable #numberofservicesInstalled, then pick the remainder. This computation's result is compared to the variable #idService.
As this computation must be done for each value, an index will not help here...
I do not think, that this can be put more sargable.
UPDATE
In a comment you suggest, that it might help to change the code behind the equal operator. No this will not help.
I tried to think of a senseful meaning of this... Is the number of a service (or - as the variable suggests - its id) somehow hidden in the invoice number?
execution plan and row estimation:
The engine will see, that this must be computed for all rows. It would help to enforce any other filter before you have to do this. But you do not show enough. The line of code we see is just one part of a condition...
Indexes and statistics will surely play their roles too...
The short answer to your direct question is no. You cannot rearrange this expression to make it a sargable predicate. At least, not with a construct which is finite and agnostic of the possible values of #numberofservicesinstalled.
I'm running queries against a baseball database and was wondering if its possible to write a query that returns the nearest neighbors (Top 20 to 50) baseball players with statistic and demographics nearest to what is included in the Where clause query. For example,
Select Top 20 Player_ID, Player_FullName
From BaseballDB
Where Age = 23 And BattingAvg = 250 And OPS = 100
I've used equal signs in my query although for what I'm trying to achieve the value doesn't actually have to be equal I'm just looking for players that fall close to an intersection of the dimensions included in my Where Clause.
I am familiar with Nearest Neighbor analyses in predictive analytics, but I am just curious if its possible to achieve something similar with SQL.
Yes, but you need to define what the distance metric is. Nearest neighbor is not one particular method; it depends on the definition of the metric.
For instance, one metric is Manhattan Distance. This would be implemented as:
select top (25) b.*
from baseballDB b
order by abs(age - 23) + abs(battingavg - 250) + abs(ops - 100);
If you square the values instead of using abs(), you have the familiar Euclidean metric (the square root is not needed for ordering purposes).
For various reasons, Manhattan Distance is probably not a suitable metric for this data (the different columns have different ranges). But this shows how to implement the nearest neighbor in a database.
I should point out that databases are not generally optimized for this type of query, so it requires sorting all the data. There are ways to optimize nearest-neighbor, but those optimization are generally not available in databases for bespoke metrics.
I think I would do something like
Raise it to a power so a big difference in any one makes a big difference
Divide to normalize
declare #Age int = 23, #Bat int = 250, #OPS int = 10, #pw float = 2;
select Top 20 Player_ID, Player_FullName
from BaseballDB
order by (#Age + power(abs(#Age - age), #pw)) / #Age
+ (#Bat + power(abs(#Bat - bat), #pw)) / #Bat
+ (#OPS + power(abs(#OPS - OPS), #pw)) / #OPS desc
I am building a little cube and have a problem with creating the calculations.
All in all I want some values based on the Plugin.
As an example I want the standard deviation of the execution time.
Something like this:
SELECT Plugin.PluginId, AVG(Task.ExecutionTimeMs) AS Mean
, STDEVP(Task.ExecutionTimeMs) AS [Standard Deviation]
, STDEVP(Task.ExecutionTimeMs) * STDEVP(Task.ExecutionTimeMs) AS Variance
FROM Task
In my Analysis Project I created an calculation with the following expression:
STDEVP( [Finished Tasks], [Measures].[Execution Time Ms Sum] )
which didn't work.
I tried some other functions (MAX,AVG) but none worked as intended so I'm obviously doing something wrong.
What is the correct way to create a such measures?
Assuming you are talking about SSAS Multidimensional then StDevP MDX is a pretty expensive function to use in your case because it has to calculate at the leaf level (fact table row level or Task level in your case). If you have more than a couple thousand tasks I would recommend an optimization which performs well and gets the right number.
This idea is from this thread.
load in a simple SUM measure (x) - the sum of fact column (aggregated in the cube)
load in a simple SUM measure of x squared (x2) - sum of square of fact column (aggregated in the cube)
load in a counter called cnt - count of a fact column (aggregated in the cube)
[The MDX calculation would then be:]
((x2 - ((x^2)/cnt))/cnt)^0.5
For my master thesis I have to use SPSS to analyse my data. Actually I thought that I don't have to deal with very difficult statistical issues, which is still true regarding the concepts of my analysis. BUT the problem is now that in order to create my dependent variable I need to use the syntax editor/ programming in general and I have no experience in this area at all. I hope you can help me in the process of creating my syntax.
I have in total approximately 900 companies with 6 year observations. For all of these companies I need the predicted values of the following company-specific regression:
Y= ß1*X1+ß2*X2+ß3*X3 + error
(I know, the ß won t very likely be significant, but this is nothing to worry about in my thesis, it will be mentioned in the limitations though).
So far my data are ordered in the following way
COMPANY YEAR X1 X2 X3
1 2002
2 2002
1 2003
2 2003
But I could easily change the order, e.g. in
1
1
2
2 etc.
Ok let's say I have rearranged the data: what I need now is that SPSS computes for each company the specific ß and returns the output in one column (the predicted values with those ß multiplied by the specific X in each row). So I guess what I need is a loop that does a multiple linear regression for 6 rows for each of the 939 companies, am I right?
As I said I have no experience at all, so every hint is valuable for me.
Thank you in advance,
Janina.
Bear in mind that with only six observations per company and three (or 4 if you also have a constant term) coefficients to estimate, the coefficient estimates are likely to be very imprecise. You might want to consider whether companies can be pooled at least in part.
You can use SPLIT FILE to estimate the regressions specific for each company, example below. Note that one would likely want to consider other panel data models, and assess whether there is autocorrelation in the residuals. (This is IMO a useful approach though for exploratory analysis of multi-level models.)
The example declares a new dataset to pipe the regression estimates to (see the OUTFILE subcommand on REGRESSION) and suppresses the other tables (with 900+ tables much of the time is spent rendering the output). If you need other statistics either omit the OMS that suppresses the tables, or tweak it to only show the tables you want. (You can use OMS to pipe other results to other datasets as well.)
************************************************************.
*Making Fake data.
SET SEED 10.
INPUT PROGRAM.
LOOP #Comp = 1 to 1000.
COMPUTE #R1 = RV.NORMAL(10,2).
COMPUTE #R2 = RV.NORMAL(-3,1).
COMPUTE #R3 = RV.NORMAL(0,5).
LOOP Year = 2003 to 2008.
COMPUTE Company = #Comp.
COMPUTE Rand1 = #R1.
COMPUTE Rand2 = #R2.
COMPUTE Rand3 = #R3.
END CASE.
END LOOP.
END LOOP.
END FILE.
END INPUT PROGRAM.
DATASET NAME Companies.
COMPUTE x1 = RV.NORMAL(0,1).
COMPUTE x2 = RV.NORMAL(0,1).
COMPUTE x3 = RV.NORMAL(0,1).
COMPUTE y = Rand1*x1 + Rand2*x2 + Rand3*x3 + RV.NORMAL(0,1).
FORMATS Company Year (F4.0).
*Now sorting cases by Company and Year, then using SPLIT file to estimate
*the regression.
SORT CASES BY Company Year.
*Declare new set and have OMS suppress the other results.
DATASET DECLARE CoeffTable.
OMS
/SELECT TABLES
/IF COMMANDS = 'Regression'
/DESTINATION VIEWER = NO.
*Now split file to get the coefficients.
SPLIT FILE BY Company.
REGRESSION
/DEPENDENT y
/METHOD=ENTER x1 x2 x3
/SAVE PRED (CompSpePred)
/OUTFILE = COVB ('CoeffTable').
SPLIT FILE OFF.
OMSEND.
************************************************************.
Need to calculate the percentile rank (1st - 99th percentile) for each student with a score for a single test.
I'm a little confused by the msdn definition of NTILE, because it does not explicitly mention percentile rank. I need some sort of assurance that NTILE is the correct keyword to use for calculating percentile rank.
declare #temp table
(
StudentId int,
Score int
)
insert into #temp
select 1, 20
union
select 2, 25
.....
select NTILE(100) OVER (order by Score) PercentileRank
from #temp
It looks correct to me, but is this the correct way to calculate percentile rank?
NTILE is absolutely NOT the same as percentile rank. NTILE simply divides up a set of data evenly by the number provided (as noted by RoyiNamir above). If you chart the results of both functions, NTILE will be a perfectly linear line from 1-to-n, whereas percentile rank will [usually] have some curves to it depending on your data.
Percentile rank is much more complicated than simply dividing it up by N. It then takes each row's number and figures out where in the distribution it lies, interpolating when necessary (which is very CPU intensive). I have an Excel sheet of 525,000 rows and it dominates my 8-core machine's CPU at 100% for 15-20 minutes just to figure out the PERCENTRANK function for a single column.
One way to think of this is, "the percentage of Students with Scores below this one."
Here is one way to get that type of percentile in SQL Server, using RANK():
select *
, (rank() over (order by Score) - 1.0) / (select count(*) from #temp) * 100 as PercentileRank
from #temp
Note that this will always be less than 100% unless you round up, and you will always get 0% for the lowest value(s). This does not necessarily put the median value at 50%, nor will it interpolate like some percentile calculations do.
Feel free to round or cast the whole expression (e.g. cast(... as decimal(4,2))) for good looking reports, or even replace - 1.0 with - 1e to force floating point calculation.
NTILE() isn't really what you're looking for in this case because it essentially divides the row numbers of an ordered set into groups rather than the values. It will assign a different percentile to two instances of the same value if those instances happen to straddle a crossover point. You'd have to then additionally group by that value and grab the max or min percentile of the group to use NTILE() in the same way as we're doing with RANK().
Is there a typo?
select NTILE(100) OVER (order by Score) PercentileRank
from #temp
And your script looks good. If you think something wrong there, could you clarify what excactly?
There is an issue with your code as NTILE distribution is not uniform. If you have 213 students, the top most 13 groups would have 3 students and the latter 87 would have 2 students each. This is not what you would ideally want in a percentile distribution.
You might want to use RANK/ROWNUM and then divide to get the %ile group.
I know this is an old thread but there's certainly a lot of misinformation about this topic making it's way around the internet.
NTILE is not designed for calculating percentile rank (AKA percent rank)
If you are using NTILE to calculate Percent Rank you are doing it wrong. Anyone who tells you otherwise is misinformed and mistaken. If you are using NTILE(100) and getting the correct answer its purely coincidental.
Tim Lehner explained the problem perfectly.
"It will assign a different percentile to two instances of the same value if those instances happen to straddle a crossover point."
In other words, using NTILE to calculate where students rank based on their test scores can result in two students with the exact same test scores receiving different percent rank values. Conversely, two students with different scores can receive the same percent rank.
For a more verbose explanation of why NTILE is the wrong tool for this job as well as well as a profoundly better performing alternative to percent_rank see: Nasty Fast PERCENT_RANK.
http://www.sqlservercentral.com/articles/PERCENT_RANK/141532/