SQL Server: Min/Max/Percentiles

SQL Server: Min/Max/Percentiles - sql-server

I'm trying to get the count, min, max, and some percentiles (10th, 25th, 50th, 75th, 90th) of base salaries for each master job title.
I'm getting the following error:
Msg 8120, Level 16, State 1, Line 1
Column 'dbo.ps_employee.Base' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

You must include all non-aggregated items in your select list, at the bottom in the GROUP BY section.
Documentation: https://learn.microsoft.com/en-us/sql/t-sql/queries/select-group-by-transact-sql

I was able to use the following query to get the percentiles. I used a separate one for count/min/max/average.
SELECT DISTINCT mj.title, PERCENTILE_DISC(.1) WITHIN GROUP (ORDER BY e.base)
OVER (PARTITION BY mj.title) AS '10th', PERCENTILE_DISC(.25) WITHIN GROUP (ORDER
BY e.base) OVER (PARTITION BY mj.title) AS '25th', PERCENTILE_DISC(.5) WITHIN
GROUP (ORDER BY e.base) OVER (PARTITION BY mj.title) AS '50th',
PERCENTILE_DISC(.75) WITHIN GROUP (ORDER BY e.base) OVER (PARTITION BY mj.title)
AS '75th', PERCENTILE_DISC(.9) WITHIN GROUP (ORDER BY e.base) OVER (PARTITION BY
mj.title) AS '90th' FROM dbo.ps_employee e FULL OUTER JOIN dbo.ps_jobs j on
e.title = j.job FULL OUTER JOIN dbo.ps_masterjobs mj
ON j.masterID = mj.ID;

Add "over (partition by title)" in your count, min, max and avg functions while also adding base to your group by functions. This will allow you to have all the values in a single row set but you will have duplicate rows in the output

Related

How to get Column from Max of multi another columns?

I need to get the G value at that row contain max of max columns (H,J,J)
Below example: after group, max value of H or I or J is 170, so I need to get column value in column G is 06/25/2022 07:00:00.
I used the following query, it seems to work but returned a lot of missing values after GROUP
"Select C,MAX(MAX(H),MAX(I),MAX(J)) as d1,G GROUP BY C HAVING H=d1 OR I=d1 OR j=d1"
How do I fix this.

Use a CTE that returns the max of H, I and J for each C like this:
WITH cte AS (
SELECT C, MAX(MAX(H), MAX(I), MAX(J)) max
FROM tablename
GROUP BY C
)
SELECT t.C, t.G
FROM tablename t
WHERE (t.c, MAX(t.H, t.I, t.J)) IN (SELECT C, max FROM cte);
For your sample data, maybe it is more suitable to GROUP BY B.
Or, with ROW_NUMBER() window function:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY C ORDER BY MAX(H, I, J) DESC) rn
FROM tablename
)
SELECT C, G
FROM cte
WHERE rn = 1;
Or, with FIRST_VALUE() window fuction:
SELECT DISTINCT C,
FIRST_VALUE(G) OVER (PARTITION BY C ORDER BY MAX(H, I, J) DESC) G
FROM tablename;
See the demo.

Select n random rows from table per group of codes

I have a table full of customer details from insurance policies or quotes. Each one is assigned an output code that relates to a marketing campaign and each occurs 4 times, one per "batch" which just represents a week in the month. I need to select a random 25 percent of the rows per code, per batch number (1-4) to put into another table so I can then hold those rows back and prevent the customer being marketed to.
All the solutions I've seen on stack so far instruct how to do this for a specific number of rows per group using a ROW_NUMBER in an initial CTE query then selecting from that where rn <= a given number. I need to do this but select 25 percent of each group instead.
I've tried this solution but the specific row number doesn't move me any further forward;
Select N random rows in group
Using the linked solution, this is how my code currently is without a complete where clause because I know this isn't quite what I need.
;WITH AttributionOutput AS (
SELECT [Output Code], BatchNo, MonthandYear
FROM [dbo].[Direct_Marketing_UK]
WHERE MonthandYear = 'Sep2019'
And [Output Code] NOT IN ('HOMELIVE','HOMELIVENB','HOMENBLE')
GROUP BY [Output Code], BatchNo, MonthandYear
HAVING COUNT(*) >= 60
)
, CodeandBatch AS (
SELECT dmuk.PK_ID,
dmuk.MonthandYear,
dmuk.PackNo,
dmuk.BatchNo,
dmuk.CustomerKey,
dmuk.URN,
dmuk.[Output Code],
dmuk.[Quote/Renewal Date],
dmuk.[Name],
dmuk.[Title],
dmuk.[Initial],
dmuk.[Forename],
dmuk.[Surname],
dmuk.[Salutation],
dmuk.[Address 1],
dmuk.[Address 2],
dmuk.[Address 3],
dmuk.[Address 4],
dmuk.[Address 5],
dmuk.[Address 6],
dmuk.[PostCode],
ROW_NUMBER() OVER(PARTITION BY dmuk.[Output Code], dmuk.BatchNo ORDER BY newid()) as rn
FROM [dbo].[Direct_Marketing_UK] dmuk INNER JOIN
AttributionOutput ao ON dmuk.[Output Code] = ao.[Output Code]
AND dmuk.BatchNo = ao.BatchNo
AND dmuk.MonthandYear = ao.MonthandYear
)
SELECT URN,
[Output Code],
[BatchNo]
FROM CodeandBatch
WHERE rn <=
I can't see how a ROW_NUMBER() can help me to grab 25 percent of the rows from every combination of Output Code and batch number.

I suggest you look at NTILE for this.

SQL syntax for complex GROUP BY with OVER statement: calculating Gini coefficient for multiple sets

I want to calculate the Gini coefficient for a number of sets, containing in a two-column table (here called #cits) containing a value and a set-ID. I have been experimenting with different Gini-coefficient calculations, described here (StackExchange query) and here (StackOverflow question with some good replies). Both of the examples only calculate one coefficient for one table, whereas I would like to do it with a GROUP BY clause.
The #cits table contains two columns, c and cid, being the value and set-ID respectively.
Here is my current try (incomplete):
select count(c) as numC,
sum(c) as totalC,
(select row_number() over(order by c asc, cid) id, c from #cits) as a
from #cits group by cid
selecting numC and totalC works well, of course, but the next line is giving me a headache. I can see that the syntax is wrong, but I can't figure out how to assign the row_number() per c per cid.
EDIT:
Based on the suggestions, I used partition, like so:
select cid,sumC = sum(a.id * a.c)
into #srep
from (
select cid,row_number() over (partition by cid order by c asc) id,
c
from #cits
) as a
group by a.cluster_id1
select count(c) as numC,
sum(c) as totalC, b.sumC
into #gtmp
from #cits a
join #srep b
on a.cid = b.cid
group by a.cid,b.sumC
select
gini = 2 * sumC / (totalC * numC) - (numC - 1) / numC
from #gtmp
This almost works. I get a result, but it is >1, which is unexpected, as the Gini-coefficient should be between 0 and 1. As stated in the comments, I would have preferred a one-query solution as well, but it is not a major issue at all.

You can "partition" the data so row numbering would start over for each ID...
but I'm not sure this is what you're after..
I'm assuming you want the CID displayed as you are grouping by it.
select count(c) as numC
, sum(c) as totalC
, row_number() over(partition by cID order by c asc) as a
, cid
from #cits group by cid
Note you don't need the subquery.
Yeah this isn't likely right.
output
NumC TotalC A CID
24 383 1 1
15 232 1 2

If I'm understanding correctly, you need numC and totalC for each C in a cid set, as well as the position of the c inside of that set. This should get you what you need:
select
rn.cid,
rn.c,
row_number() over (partition by rn.cid order by rn.c) as id,
agg.numC,
agg.totalC
from #cits rn
left outer join
(
select
cid,
count(c) as numC,
sum(c) as totalC
from #cits
group by cid
) agg
on rn.cid = agg.cid

GROUP BY doesn't contain specific column

I have the following statement in MSSQL
SELECT a, b, MAX(t)
FROM table
GROUP BY a, b
What I want is just to show c and d columns for each specific row in the result. How can I do that?

It sounds like you're looking for ROW_NUMBER() or RANK() (the former will ignore ties, the latter will include them), something like:
;With Ranked as (
SELECT a,b,c,d,t,
ROW_NUMBER() OVER (PARTITION BY a,b
ORDER BY t desc) as rn
FROM table
)
SELECT * from Ranked where rn = 1
Which will return one row for each unique combination of the a,b columns, choosing the other values such that they come from the row with the highest t value (and, as I say, this variant ignores ties).

Do I need to use the dreaded sql server loop/ cursor for the result set I need?

I need a sql server result set that "breaks" on a column value, but if I order by this column in a ranking function, the order I really need is lost. This is best explained by example. The query I'm currently experimenting with is:
select RANK() over(partition by Symbol, Period order by TradeDate desc)
SymbSmaOverUnderGroup, Symbol, TradeDate, Period, Value, Low, LowMinusVal,
LMVSign
from #smasAndLow3
and it returns:
Rnk Symbol TradeDate Period Value Low LowMinusVal LMVSign
1 A 9/6/12 5 37.09 36.71 -.38 U
2 A 9/5/12 5 37.03 36.62 -.41 U
3 A 9/4/12 5 37.07 36.71 -.36 U
4 A 8/31/12 5 37.15 37.30 .15 O
5 A 8/30/12 5 37.22 37.40 .18 O
6 A 8/29/12 5 37.00 36.00 -1.00 U
7 A 8/28/12 5 37.10 37.00 -.10 U
The rank I need here is: 1,1,1,2,2,3,3. So I need to partition by Symbol, Period, and I need to start a new partition on LMVSign (which only contains the values U, O, and E), but it's essential that I order by TradeDate desc. Unless I'm mistaken, partitioning or ordering by LMVSign will make it impossible to sort on the date column. I hope this makes sense. I'm working like mad to do this without a cursor, but I can't get it to work.. thanks in advance.

UPDATE after clarification: I think that you are entering the world of islands and gaps. If your requirement is to group rows by Symbol, Period and LMVSign ordered descendingly by TradeDate, ranking them when any one of these columns change, you might use this (by Itzik Ben-Gan's solution to islands and gaps).
; with islandsAndGaps as
(
select *,
-- Create groups. Important part is order by
-- The difference remains the same as two sequences
-- run along, but the number itself is not ordered
row_number() over (partition by Symbol, Period
order by TradeDate)
- row_number() over (partition by Symbol, Period
order by LMVSign, TradeDate) grp
from Table1
),
grouped as
(
select *,
-- So to order it we use last date in group
-- (mind partition by uses changed order by from second row_number
-- and unordered group number
max(TradeDate) over(partition by LMVSign, grp) DateGroup
from islandsAndGaps
)
-- now we can get rank
select dense_rank() over (order by DateGroup desc) Rnk,
*
from grouped
order by TradeDate desc
Take a look at Sql Fiddle.
OLD answer:
Partition by restarts ranking. I think that you need order by:
dense_rank() over (order by Symbol, Period, LMVSign desc) Rnk
and then you should use TradeDate in order by:
order by Rnk, TradeDate desc
If you need it as a number, add another column:
row_number() over (order by Symbol, Period, LMVSign desc, TradeDate desc) rn