T-SQL: aggregate function for calculating Nth percentile - sql-server

I am trying to calculate the Nth percentile of all of the values in a single column in a table. All I want is a scalar, aggregate value for which N percent of the values are below. For instance, If the table has 100 rows where the value is the same as the row index plus one (1 to 100 consecutively), then I'd want this value to tell me that 95% of the values are below 95.
The PERCENTILE_CONT analytic function looks closest to what I want. But if I try to use it like this:
SELECT PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ValueColumn) OVER () AS P95
I get one row per row in the table, all with the same value. I could use TOP 1 to just give me one of those rows, but now I've done an additional table scan.
I am not trying to create a wizbang table of results partitioned by some other column in the original table. I just want an aggregate, scalar value.
Edit: I have been able to use PERCENTILE_CONT in a query with a WHERE clause. For example:
DECLARE #P95 INT
SELECT TOP 1 #P95 = (PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ValueColumn) OVER ())
FROM ExampleTable
WHERE LOWER(Color) = 'blue'
SELECT #P95
Including the WHERE clause gives a different result than I got without it.

From what I can tell, you will need to do a subquery here. For example, to find the number of records strictly below the 95 percentile we can try:
WITH cte AS (
SELECT ValueColumn,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY ValueColumn) OVER () AS P95
FROM yourTable
)
SELECT COUNT(*)
FROM cte
WHERE ValueColumn < P95;

Related

Row number for for same value

The result of my SQL Server query returns 3 columns.
Select Id, InItemId, Qty
from Mytable
order by InItemId
I need to add a column, call it row, that starts from 1 and will increase by 1, based on the initemid column with same value.
So the result should be:
Thank you !
Use row_number():
select row_number() over (partition by initemid order by initemid) as row,
t.*
from t;
Note: There is no ordering within a given value of initemid. SQL tables represent unordered sets and there is no obvious column to use for ordering.

Using Top in T-SQL

A question on using Top. For example, we have this SQL statement:
SELECT TOP (5) WITH TIES orderid, orderdate, custid, empid
FROM Sales.Orders
ORDER BY orderdate DESC;
It orders return rows by orderdate first then select the top most five rows.
But isn't that ORDER clause happens after SELECT clause, which means that the first five order in random will be returned first then those five rows are ordered by orderdate?
The order of commands in the statement doesn't reflect the actual order of operations that SQL follows. See this article which shows the order to be:
from
where
group by
having
select
order by
limit
As you can see, the TOP operation (limit) is the last to be executed.
Question has already an accepted answer. But I would like to quote content from Microsoft Documentation.
Logical Processing Order of the SELECT statement
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
But isn't that ORDER clause happens after SELECT clause, which means
that the first five order in random will be returned first then those
five rows are ordered by orderdate ?
No. ORDER BY is processed after the SELECT, but limiting the result set to 5 rows happens even later.
The physical details of actual query processing may vary, but the end result would be as if the server sorted the whole table by orderdate, then picked the top 5 (or more if needed due to ties) rows, return those rows and discard the rest.

Retrieving X rows from an ordered CTE, TOP vs Range

Objective:
Want to know which is faster/better performance when trying to retrieve a finite number of rows from CTE that is already ordered.
Example:
Say I have a CTE(intentionally simplified) that looks like this, and I only want the top 5 rows :
WITH cte
AS (
SELECT Id = RANK() OVER (ORDER BY t.ActionID asc)
, t.Name
FROM tblSample AS t -- tblSample is indexed on Id
)
Which is faster:
SELECT TOP 5 * FROM cte
OR
SELECT * FROM cte WHERE Id BETWEEN 1 AND 5 ?
Notes:
I am not a DB programmer, so to me the TOP solution seems better as
once SS finds the 5th row, it will stop executing and "return" (100%
assumption) while in the other method, i feel it will unnecessarily
process the whole cte.
My question is for a CTE, would the answer to this question be the same if it were a table?
The most important thing to note is that both queries are not going to always produce the same result set. Consider the following data:
CREATE TABLE #tblSample (ActionId int not null, name varchar(10) not null);
INSERT #tblSample VALUES (1,'aaa'),(2,'bbb'),(3,'ccc');
Both of these will produce the same result:
WITH CTE AS
(
SELECT id = RANK() OVER (ORDER BY t.ActionID asc), t.name
FROM #tblSample t
)
SELECT TOP(2) * FROM CTE;
WITH CTE AS
(
SELECT id = RANK() OVER (ORDER BY t.ActionID asc), t.name
FROM #tblSample t
)
SELECT * FROM CTE WHERE id BETWEEN 1 AND 2;
Now let's do this update:
UPDATE #tblSample SET ActionId = 1;
After this update the first query still returns two rows, the second query returns 3. Keep in mind too that, without an ORDER BY in the TOP query the results are not guaranteed because there is no default order in SQL.
With that out of the way - which performs better? It depends. It depends on your indexing, your statistics, number of rows, and the execution plan that the SQL Engine goes with.
Top 5 selects any 5 rows as per Index defined on the table whereas Id between 1 and 5 tries to fetch data based on Id column whether by Index seek or scan depends on the selected attributes. Both are two different queries.. 'Id between' query might be slow if you do not have any index on Id,
Let me try to explain with an example...
Consider this is your data..
create index nci_name on yourcte(id) include(name)
--drop index nci_name on yourcte
;with cte as (
select * from yourcte )
select top 5 * from cte
;with cte as (
select * from yourcte )
select * from cte where id between 1 and 5
First i am creating index on id with name included, Now if you see your second query does Index seek and first one does index scan and selects top 5, so in this case second approach is better
See the execution plan:
Now i am removing the index
Executing
--drop index nci_name on yourtable
Now it does table scan on both the approaches
If you notice in both the table scans, in the first one it reads only 5 rows and second approach it reads 10 rows and applies predicate
See execution plan properties for first plan
For second approach it reads 10 rows
Now first approach is better..
In your case this index needs to be on ActionId which determines the id.
Hence performance depends on how you index on your base table.
In order to get the RANK() which you are calculating in your cte it must sort all the data by t.ActionID. Sorting is a blocking operation: the entire input must be processed before a single row is output.
So in this case whether you select any five rows, or if you take the five that sorted to the top of the pile is probably irrelevant.

Retrieving specific number of rows based on sum of row number

After reading an experimenting I decided I need to ask:
I am trying to retrieve a specific number of rows from a table based on the sum of the row number: This is a basic table with two columns: CusID, CusName.
I started by numbering each row to 1 so that I can use a SUM of the row number, or so I thought.
WITH Example AS
(
SELECT
*,
ROW_NUMBER() OVER (Partition by CusID ORDER BY CusID) AS RowNumber
FROM
MySchema.MyTable
)
I am not sure how to move beyond here. I tried using the HAVING clause but obviously that would not work. I could also use TOP or Percent.
But I would like to retrieve the rows based on the sum of row number.
What's the way to do this?
First of all Windowed functions cannot be used in the context of another windowed function or aggregate.So you can not use Aggregate function inside the row_number I think it could better than use all function after your with like this
WITH Example AS
(
SELECT *, ROW_NUMBER() OVER (Partition by CusID ORDER BY CusID) AS RowNumber
FROM MySchema.MyTable
)
select cusid,cusname,sum(rownumber) from example
group by Cusid,Cusname
having .....

Calculating mean from column values

I want a result-set as below from a table:
I tried query:
select sdate, sum(PG)/sum(PT)*100 AS Score, avg(score) as Mean from table
but I am not getting the correct Mean.
Mean is: sum of all scores / total number of scores.
I want to show mean as computed column. In the above result-set, total of scores is 309 and when divided by 4 (total number of rows) it gives 77.25.
I want to display the result as shown in the result-set.
I believe you're looking for something like:
SELECT sdate,
SUM(PG)/SUM(PT)*100 AS Score,
(SELECT AVG(score) FROM table) AS Mean
FROM table
This should set the mean to the average score across the entire table. If you have WHERE clauses for filtering, you would have to place them in both the subquery and the main query.
EDIT
If the original SQL statement has a GROUP BY, as it sounds like it does, then you could use the following query to achieve what you're looking for:
SELECT sdate,
SUM(PG)/SUM(PT)*100 AS score,
(SELECT AVG(score)
FROM (SELECT CAST(SUM(PG)/SUM(PT)*100 AS FLOAT) AS score
FROM table
GROUP BY sdate) scores) AS Mean
FROM table
GROUP BY sdate
It's not pretty, but I believe it'll accomplish what you're looking for.
DECLARE #mean DECIMAL(5,2);
SELECT #mean = AVG(score) FROM dbo.table;
SELECT sdate, score, Mean = #mean FROM dbo.table;
I think this will work for you
as you need to show avg in each row , i am using subquery to generated that avg each time .
create table mean
(
d date,
score int
)
insert into mean values ('01/01/2013',50),('02/01/2013',60)
,('03/01/2013',40),('04/01/2013',30)
,('05/01/2013',20),('06/01/2013',20)
SELECT d,score,(select sum(score)/count(*) from mean) from mean
SQL FIDDLE LINK
but it will be good if you count avg first and use that variable in your select statement
SELECT d,score,
ISNULL((select CONVERT(DECIMAL(10,2),CONVERT(DECIMAL(10,2),sum(score))/count(*) ) from #mean) ,0) as mean from #mean

Resources