Select top 10 percent, also bottom percent in SQL Server - sql-server

I have two questions:
When using the select top 10 percent statement, for example on a test database with 100 scores, like this:
Select top 10 percent score
from test
Would SQL Server return the 10 highest scores, or just the top 10 obs based on how the data look like now (e.g. if the data is entered into database in a way that lowest score appears first, then would this return the lowest 10 scores)?
I want to be able to get the top 10 highest scores and bottom 10 lowest scores out of this 100 scores, what should I do?

You could also use the NTILE window function to group your scores into 10 groups of data - group no. 1 would be the lowest 10%, group no. 10 would be the top 10%:
;WITH Percentile AS
(
SELECT
Score,
ScoreGroup = NTILE(10) OVER(ORDER BY Score)
FROM
test
)
SELECT *
FROM Percentile
WHERE ScoreGroup IN (1, 10)

Using a UNION ALL means that it will count all rows twice.
You can do it with a single count as below. Whether or not this will be more efficient will depend (e.g. on indexes).
WITH T
AS (SELECT *,
1E0 * ROW_NUMBER()
OVER (
ORDER BY score) / COUNT(*)
OVER() AS p
FROM test)
SELECT *
FROM T
WHERE p < 0.1
OR p > 0.9

select score from
(Select top 10 percent score
from test
order by score desc
)a
union all
select score from
(select top 10 percent score
from test
order by score asc
)b
if duplicates are allowed use union

Use ascending in your query for the top 90. Then, descending in your query for the top 10. Then, union these two queries

Related

Calculate percentage in integer of a column in SQL SERVER?

I have just started experimenting how to calculate the percentage of a row. This is the code I write.
SELECT DISTINCT
ServiceName
COUNT(serviceID) AS Services
FROM Tester_DW
WHERE DateToday=20150410
GROUP BY ServiceName
How can calculate the percentage of the column Services above, and have the percentage in integer? Is it easier to calculate the percentage of the code example if I put my query result in a #temp table and calculate the percentage from the #temp or is it possible to calculate the percentage in integer% on the fly?
ADDED:Output sketch
ServiceName|Services| % of Total
--------------------------------
TV-cable | 4500 | 40%
--------------------------------
Mobile BB | 3000 | 10%
--------------------------------
MOBILE wifi| 20 | 5%
--------------------------------
It is hard to get it right, because you should deal with the sum of rounded integer percentage to get it 100% in total.
Using Largest Remainder Method
;WITH x AS
(
SELECT
ServiceName,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () AS [Percent],
FLOOR(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER ()) AS [IntPercent],
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () % 1 AS [Remainder]
FROM Tester_DW
GROUP BY ServiceName
)
SELECT ServiceName, IntPercent + CASE WHEN Priority <= Gap THEN 1 ELSE 0 END AS IntPercent
FROM
(
SELECT *, ROW_NUMBER() OVER (ORDER BY Remainder DESC) AS Priority, 100 - SUM(IntPercent) OVER () AS Gap FROM x
) data
Percentage is a count divided by the overall (reference) count. You have to use an inner query to get that overall count:
SELECT ServiceName, COUNT(serviceID) AS Services,
FLOOR(COUNT(serviceID) / (SELECT COUNT(serviceID) FROM Tester_DW)) AS percent
FROM Tester_DW WHERE...
Depending on the output you want (that is, what your reference count is), you may have to add the WHERE clause (or parts of it) in the inner query as well.
Select
ServiceName,
Count(ServiceName) as Services,
(Count(ServiceName)* 100 / (Select Count(*) From Tester_DW)) as [% of total]
From Tester_DW
Group By ServiceName
You simply divide the count of a single service by the amount of all services.
Try with window functions:
Create table t(id int)
Insert into t values
(1),(1),(2)
Select * from (Select id,
100.0*count(*) over(partition by id)/ count(*) over() p
from t) t
Group by id, p

How to retrieve 7th row in a table using SQL query

I have a table with some number of records in it say the table is Student and it has a column named total_mark . Now I need to fetch the details of the student who is 7th largest total from the total_mark column . How to perform this operation in SQL SERVER 2008?
First, define what you mean by "7th". 7th in age? 7th in IQ? 7th in height? Whatever.
WITH
RankedStudents AS (
SELECT *, ROW_NUMBER() OVER ( ORDER BY <Whatever> ) AS RowNumber FROM <Schema>.<Object>
)
SELECT *
FROM RankedStudents
WHERE RowNumber = 7 ;
First select the top 7, then reverse the ordering and take just the first row:
SELECT TOP 1 * FROM (
SELECT TOP 7 *
FROM RankedStudents
ORDER BY total_mark desc) x
ORDER BY total_mark
try this
WITH CTE AS(
SELECT total_mark,RANK() OVER (ORDER BY total_mark DESC)AS RANKED FROM SN_DB)
SELECT DISTINCT * FROM CTE WHERE RANKED = 7

how to get certain sql results

i'm looking to get certain sql results from a query depending on where they are positioned, for example, consider this code
SELECT * FROM Product ORDER BY id asc
which could return at least 100 or so results.
the question is though, how can i get the first 1 - 10 results of that, and then in another different, separate query, how can i get the results that are 11 - 20 or even get the results that are positioned 51 - 60 of that query?
Use a CTE to get the row number and then query by the row column
with your_query as(
SELECT ROW_NUMBER() OVER(ORDER BY ID ASC) AS Row, *
FROM Product
)
select * from your_query
where Row >=5 and Row<=10
There are a number of ways, here's one approach using ROW_NUMBER:
DECLARE #StartRow INTEGER = 11
DECLARE #EndRow INTEGER = 20
;WITH Data AS
(
SELECT TOP(#EndRow) ROW_NUMBER() OVER (ORDER BY id) AS RowNo, *
FROM Product
)
SELECT *
FROM Data
WHERE RowNo BETWEEN #StartRow AND #EndRow
ORDER BY Id

How do I exclude outliers from an aggregate query?

I'm creating a report comparing total time and volume across units. Here a simplification of the query I'm using at the moment:
SELECT m.Unit,
COUNT(*) AS Count,
SUM(m.TimeInMinutes) AS TotalTime
FROM main_table m
WHERE m.unit <> ''
AND m.TimeInMinutes > 0
GROUP BY m.Unit
HAVING COUNT(*) > 15
However, I have been told that I need to exclude cases where the row's time is in the highest or lowest 5% to try and get rid of a few wacky outliers. (As in, remove the rows before the aggregates are applied.)
How do I do that?
You can exclude the top and bottom x percentiles with NTILE
SELECT m.Unit,
COUNT(*) AS Count,
SUM(m.TimeInMinutes) AS TotalTime
FROM
(SELECT
m.Unit,
NTILE(20) OVER (ORDER BY m.TimeInMinutes) AS Buckets
FROM
main_table m
WHERE
m.unit <> '' AND m.TimeInMinutes > 0
) m
WHERE
Buckets BETWEEN 2 AND 19
GROUP BY m.Unit
HAVING COUNT(*) > 15
Edit: this article has several techniques too
One way would be to exclude the outliers with a not in clause:
where m.ID not in
(
select top 5 percent ID
from main_table
order by
TimeInMinutes desc
)
And another not in clause for the bottom five percent.
NTile is quite inexact. If you run NTile against the sample view below, you will see that it catches some indeterminate number of rows instead of 90% from the center. The suggestion to use TOP 95%, then reverse TOP 90% is almost correct except that 90% x 95% gives you only 85.5% of the original dataset. So you would have to do
select top 94.7368 percent *
from (
select top 95 percent *
from
order by .. ASC
) X
order by .. DESC
First create a view to match your table column names
create view main_table
as
select type unit, number as timeinminutes from master..spt_values
Try this instead
select Unit, COUNT(*), SUM(TimeInMinutes)
FROM
(
select *,
ROW_NUMBER() over (order by TimeInMinutes) rn,
COUNT(*) over () countRows
from main_table
) N -- Numbered
where rn between countRows * 0.05 and countRows * 0.95
group by Unit, N.countRows * 0.05, N.countRows * 0.95
having count(*) > 20
The HAVING clause is applied to the remaining set after removing outliers.
For a dataset of 1,1,1,1,1,1,2,5,6,19, the use of ROW_NUMBER allows you to correctly remove just one instance of the 1's.
I think the most robust way is to sort the list into order and then exclude the top and bottom extremes. For a hundred values, you would sort ascending and take the first 95 PERCENT, then sort descending and take the first 90 PERCENT.

SQL Select Statement For Calculating A Running Average Column

I am trying to have a running average column in the SELECT statement based on a column from the n previous rows in the same SELECT statement. The average I need is based on the n previous rows in the resultset.
Let me explain
Id Number Average
1 1 NULL
2 3 NULL
3 2 NULL
4 4 2 <----- Average of (1, 3, 2),Numbers from previous 3 rows
5 6 3 <----- Average of (3, 2, 4),Numbers from previous 3 rows
. . .
. . .
The first 3 rows of the Average column are null because there are no previous rows. The row 4 in the Average column shows the average of the Number column from the previous 3 rows.
I need some help trying to construct a SQL Select statement that will do this.
This should do it:
--Test Data
CREATE TABLE RowsToAverage
(
ID int NOT NULL,
Number int NOT NULL
)
INSERT RowsToAverage(ID, Number)
SELECT 1, 1
UNION ALL
SELECT 2, 3
UNION ALL
SELECT 3, 2
UNION ALL
SELECT 4, 4
UNION ALL
SELECT 5, 6
UNION ALL
SELECT 6, 8
UNION ALL
SELECT 7, 10
--The query
;WITH NumberedRows
AS
(
SELECT rta.*, row_number() OVER (ORDER BY rta.ID ASC) AS RowNumber
FROM RowsToAverage rta
)
SELECT nr.ID, nr.Number,
CASE
WHEN nr.RowNumber <=3 THEN NULL
ELSE ( SELECT avg(Number)
FROM NumberedRows
WHERE RowNumber < nr.RowNumber
AND RowNumber >= nr.RowNumber - 3
)
END AS MovingAverage
FROM NumberedRows nr
Assuming that the Id column is sequential, here's a simplified query for a table named "MyTable":
SELECT
b.Id,
b.Number,
(
SELECT
AVG(a.Number)
FROM
MyTable a
WHERE
a.id >= (b.Id - 3)
AND a.id < b.Id
AND b.Id > 3
) as Average
FROM
MyTable b;
Edit: I missed the point that it should average the three previous records...
For a general running average, I think something like this would work:
SELECT
id, number,
SUM(number) OVER (ORDER BY ID) /
ROW_NUMBER() OVER (ORDER BY ID) AS [RunningAverage]
FROM myTable
ORDER BY ID
A simple self join would seem to perform much better than a row referencing subquery
Generate 10k rows of test data:
drop table test10k
create table test10k (Id int, Number int, constraint test10k_cpk primary key clustered (id))
;WITH digits AS (
SELECT 0 as Number
UNION SELECT 1
UNION SELECT 2
UNION SELECT 3
UNION SELECT 4
UNION SELECT 5
UNION SELECT 6
UNION SELECT 7
UNION SELECT 8
UNION SELECT 9
)
,numbers as (
SELECT
(thousands.Number * 1000)
+ (hundreds.Number * 100)
+ (tens.Number * 10)
+ ones.Number AS Number
FROM digits AS ones
CROSS JOIN digits AS tens
CROSS JOIN digits AS hundreds
CROSS JOIN digits AS thousands
)
insert test10k (Id, Number)
select Number, Number
from numbers
I would pull the special case of the first 3 rows out of the main query, you can UNION ALL those back in if you really want it in the row set. Self join query:
;WITH NumberedRows
AS
(
SELECT rta.*, row_number() OVER (ORDER BY rta.ID ASC) AS RowNumber
FROM test10k rta
)
SELECT nr.ID, nr.Number,
avg(trailing.Number) as MovingAverage
FROM NumberedRows nr
join NumberedRows as trailing on trailing.RowNumber between nr.RowNumber-3 and nr.RowNumber-1
where nr.Number > 3
group by nr.id, nr.Number
On my machine this takes about 10 seconds, the subquery approach that Aaron Alton demonstrated takes about 45 seconds (after I changed it to reflect my test source table) :
;WITH NumberedRows
AS
(
SELECT rta.*, row_number() OVER (ORDER BY rta.ID ASC) AS RowNumber
FROM test10k rta
)
SELECT nr.ID, nr.Number,
CASE
WHEN nr.RowNumber <=3 THEN NULL
ELSE ( SELECT avg(Number)
FROM NumberedRows
WHERE RowNumber < nr.RowNumber
AND RowNumber >= nr.RowNumber - 3
)
END AS MovingAverage
FROM NumberedRows nr
If you do a SET STATISTICS PROFILE ON, you can see the self join has 10k executes on the table spool. The subquery has 10k executes on the filter, aggregate, and other steps.
Want to improve this post? Provide detailed answers to this question, including citations and an explanation of why your answer is correct. Answers without enough detail may be edited or deleted.
Check out some solutions here. I'm sure that you could adapt one of them easily enough.
If you want this to be truly performant, and arn't afraid to dig into a seldom-used area of SQL Server, you should look into writing a custom aggregate function. SQL Server 2005 and 2008 brought CLR integration to the table, including the ability to write user aggregate functions. A custom running total aggregate would be the most efficient way to calculate a running average like this, by far.
Alternatively you can denormalize and store precalculated running values. Described here:
http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/01/23/denormalizing-to-enforce-business-rules-running-totals.aspx
Performance of selects is as fast as it goes. Of course, modifications are slower.

Resources