SQL Server NTILE - Same value in different quartile - sql-server

I have a scenario where i'm splitting a number of results into quartilies using the SQL Server NTILE function below. The goal is to have an as equal number of rows in each class
case NTILE(4) over (order by t2.TotalStd)
when 1 then 'A' when 2 then 'B' when 3 then 'C' else 'D' end as Class
The result table is shown below and there is a (9,9,8,8) split between the 4 class groups A,B,C and D.
There are two results which cause me an issue, both rows have a same total std value of 30 but are assigned to different quartiles.
8 30 A
2 30 B
I'm wondering is there a way to ensure that rows with the same value are assigned to the same quartile? Can i group or partition by another column to get this behaviour?
Pos TotalStd class
1 16 A
2 23 A
3 21 A
4 29 A
5 25 A
6 26 A
7 28 A
8 30 A
9 29 A
1 31 B
2 30 B
3 32 B
4 32 B
5 34 B
6 32 B
7 34 B
8 32 B
9 33 B
1 36 C
2 35 C
3 35 C
4 35 C
5 40 C
6 38 C
7 41 C
8 43 C
1 43 D
2 48 D
3 45 D
4 47 D
5 44 D
6 48 D
7 46 D
8 57 D

You will need to re create the Ntile function, using the rank function.
The rank function gives the same rank for rows with the same value. The value later 'jumps' to the next rank as if you used row_number.
We can use this behavior to mimic the Ntile function, forcing it to give the same Ntile value to rows with the same value. However - this will cause the Ntile partitions to be with a different size.
See the example below for the new Ntile using 4 bins:
declare #data table ( x int )
insert #data values
(1),(2),
(2),(3),
(3),(4),
(4),(5)
select
x,
1+(rank() over (order by x)-1) * 4 / count(1) over (partition by (select 1)) as new_ntile
from #data
Results:
x new_ntile
---------------
1 1
2 1
2 1
3 2
3 2
4 3
4 3
5 4

Not sure what you're expecting to happen here, really. SQL Server has divided up the data into 4 groups of as-equal-size-as-possible, as you asked. What do you want to happen? Have a look at this example:
declare #data table ( x int )
insert #data values
(1),(2),
(2),(3),
(3),(4),
(4),(5)
select
x,
NTILE(4) over (order by x) as ntile
from #data
Results:
x ntile
----------- ----------
1 1
2 1
2 2
3 2
3 3
4 3
4 4
5 4
Now every ntile group shares a value with the one(s) next to it! But what else should it do?

Try this:
; with a as (
       select TotalStd,Class=case ntile(4)over( order by TotalStd )
                                when 1 then 'A'
                                when 2 then 'B'
                                when 3 then 'C'
                                when 4 then 'D'
                                end
                from t2
                group by TotalStd
)
select d.*, a.Class from t2 d
inner join a on a.TotalStd=d.TotalStd
order by Class,Pos;

Here we have a table of 34 rows.
DECLARE #x TABLE (TotalStd INT)
INSERT #x (TotalStd) VALUES (16), (21), (23), (25), (26), (28), (29), (29), (30), (30), (31), (32), (32), (32), (32), (33), (34),
(34), (35), (35), (35), (36), (38), (40), (41), (43), (43), (44), (45), (46), (47), (48), (48), (57)
SELECT '#x', TotalStd FROM #x ORDER BY TotalStd
We want to divide into quartiles. If we use NTILE, the bucket sizes will be roughly the same size (8 to 9 rows each) but ties are broken arbitrarily:
SELECT '#x with NTILE', TotalStd, NTILE(4) OVER (ORDER BY TotalStd) quantile FROM #x
See how 30 appears twice: once in quantile 1 and once in quantile 2. Similarly, 43 appears both in quantiles 3 and 4.
What I ought to find is 10 items in quantile 1, 8 in quantile 2, 7 in quantile 3 and 9 in quantile 4 (i.e. not a perfect 9-8-9-8 split, but such a split is impossible if we are not allowed to break ties arbitrarily). I can do it using NTILE to determine cutoff points in a temporary table:
DECLARE #cutoffs TABLE (quantile INT, min_value INT, max_value INT)
INSERT #cutoffs (quantile, min_value)
SELECT y.quantile, MIN(y.TotalStd)
FROM (SELECT TotalStd, NTILE(4) OVER (ORDER BY TotalStd) AS quantile FROM #x) y
GROUP BY y.quantile
-- The max values are the minimum values of the next quintiles
UPDATE c1 SET c1.max_value = ISNULL(C2.min_value, (SELECT MAX(TotalStd) + 1 FROM #x))
FROM #cutoffs c1 LEFT OUTER JOIN #cutoffs c2 ON c2.quantile - 1 = c1.quantile
SELECT '#cutoffs', * FROM #cutoffs
We'll use the the boundary values in the #cutoffs table to create the final table:
SELECT x.TotalStd, c.quantile FROM #x x
INNER JOIN #cutoffs c ON x.TotalStd >= c.min_value AND x.TotalStd < c.max_value

Related

SQL Server script not working as expected

I have this little script that shall return the first number in a column of type int which is not used yet.
SELECT t1.plu + 1 AS plu
FROM tovary t1
WHERE NOT EXISTS (SELECT 1 FROM tovary t2 WHERE t2.plu = t1.plu + 1)
AND t1.plu > 0;
this returns the unused numbers like
3
11
22
27
...
The problem is, that when I make a simple select like
SELECT plu
FROM tovary
WHERE plu > 0
ORDER BY plu ASC;
the results are
1
2
10
20
...
Why the first script isn't returning some of free numbers like 4, 5, 6 and so on?
Compiling a formal answer from the comments.
Credit to Larnu:
It seems what the OP really needs here is an (inline) Numbers/Tally (table) which they can then use a NOT EXISTS against their table.
Sample data
create table tovary
(
plu int
);
insert into tovary (plu) values
(1),
(2),
(10),
(20);
Solution
Isolating the tally table in a common table expression First1000 to produce the numbers 1 to 1000. The amount of generated numbers can be scaled up as needed.
with First1000(n) as
(
select row_number() over(order by (select null))
from ( values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0) ) a(n) -- 10^1
cross join ( values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0) ) b(n) -- 10^2
cross join ( values (0),(0),(0),(0),(0),(0),(0),(0),(0),(0) ) c(n) -- 10^3
)
select top 20 f.n as Missing
from First1000 f
where not exists ( select 'x'
from tovary
where plu = f.n);
Using top 20 in the query above to limit the output. This gives:
Missing
-------
3
4
5
6
7
8
9
11
12
13
14
15
16
17
18
19
21
22
23
24

Grouping between two datetimes

I have a bunch of production orders and I'm trying to group by within a datetime range, then count the quantity within that range. For example, I want to group from 2230 to 2230 each day.
PT.ActualFinish is datetime (eg. if PT.ActualFinish is 2020-05-25 23:52:30 then it would be counted on the 26th May instead of the 25th)
Currently it's grouped by date (midnight to midnight) as opposed to the desired 2230 to 2230.
GROUP BY CAST(PT.ActualFinish AS DATE)
I've been trying to reconcile some DATEADD with the GROUP without success. Is it possible?
Just add 1.5 hours (90 minutes) and then extract the date:
group by convert(date, dateadd(minute, 90, pt.acctualfinish))
For this kind of thing you can use a function I created called NGroupRangeAB (code below) which can be used to create groups over values with an upper and lower bound.
Note that this:
SELECT f.*
FROM core.NGroupRangeAB(0,1440,12) AS f
ORDER BY f.RN;
Returns:
RN GroupNumber Low High
--- ------------ ------ -------
0 1 0 120
1 2 121 240
2 3 241 360
3 4 361 480
4 5 481 600
5 6 601 720
6 7 721 840
7 8 841 960
8 9 961 1080
9 10 1081 1200
10 11 1201 1320
11 12 1321 1440
This:
SELECT
f.GroupNumber,
L = DATEADD(MINUTE,f.[Low]-SIGN(f.[Low]),CAST('00:00:00.0000000' AS TIME)),
H = DATEADD(MINUTE,f.[High]-1,CAST('00:00:00.0000000' AS TIME))
FROM core.NGroupRangeAB(0,1440,12) AS f
ORDER BY f.RN;
Returns:
GroupNumber L H
------------- ---------------- ----------------
1 00:00:00.0000000 01:59:00.0000000
2 02:00:00.0000000 03:59:00.0000000
3 04:00:00.0000000 05:59:00.0000000
4 06:00:00.0000000 07:59:00.0000000
5 08:00:00.0000000 09:59:00.0000000
6 10:00:00.0000000 11:59:00.0000000
7 12:00:00.0000000 13:59:00.0000000
8 14:00:00.0000000 15:59:00.0000000
9 16:00:00.0000000 17:59:00.0000000
10 18:00:00.0000000 19:59:00.0000000
11 20:00:00.0000000 21:59:00.0000000
12 22:00:00.0000000 23:59:00.0000000
Now for a real-life example that may help you:
-- Sample Date
DECLARE #table TABLE (tm TIME);
INSERT #table VALUES ('00:15'),('11:20'),('21:44'),('09:50'),('02:15'),('02:25'),
('02:31'),('23:31'),('23:54');
-- Solution:
SELECT
GroupNbr = f.GroupNumber,
TimeLow = f2.L,
TimeHigh = f2.H,
Total = COUNT(t.tm)
FROM core.NGroupRangeAB(0,1440,12) AS f
CROSS APPLY (VALUES(
DATEADD(MINUTE,f.[Low]-SIGN(f.[Low]),CAST('00:00:00.0000000' AS TIME)),
DATEADD(MINUTE,f.[High]-1,CAST('00:00:00.0000000' AS TIME)))) AS f2(L,H)
LEFT JOIN #table AS t
ON t.tm BETWEEN f2.L AND f2.H
GROUP BY f.GroupNumber, f2.L, f2.H;
Returns:
GroupNbr TimeLow TimeHigh Total
-------------------- ---------------- ---------------- -----------
1 00:00:00.0000000 01:59:00.0000000 1
2 02:00:00.0000000 03:59:00.0000000 3
3 04:00:00.0000000 05:59:00.0000000 0
4 06:00:00.0000000 07:59:00.0000000 0
5 08:00:00.0000000 09:59:00.0000000 1
6 10:00:00.0000000 11:59:00.0000000 1
7 12:00:00.0000000 13:59:00.0000000 0
8 14:00:00.0000000 15:59:00.0000000 0
9 16:00:00.0000000 17:59:00.0000000 0
10 18:00:00.0000000 19:59:00.0000000 0
11 20:00:00.0000000 21:59:00.0000000 1
12 22:00:00.0000000 23:59:00.0000000 2
Note that an inner join will eliminate the 0-count rows.
CREATE FUNCTION core.NGroupRangeAB
(
#min BIGINT, -- Group Number Lower boundary
#max BIGINT, -- Group Number Upper boundary
#groups BIGINT -- Number of groups required
)
/*****************************************************************************************
[Purpose]:
Creates an auxilliary table that allows for grouping based on a given set of rows (#rows)
and requested number of "row groups" (#groups). core.NGroupRangeAB can be thought of as a
set-based, T-SQL version of Oracle's WIDTH_BUCKET, which:
"...lets you construct equiwidth histograms, in which the histogram range is divided into
intervals that have identical size. (Compare with NTILE, which creates equiheight
histograms.)" https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions214.htm
See usage examples for more details.
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2008+
[Syntax]:
--===== Autonomous
SELECT ng.*
FROM dbo.NGroupRangeAB(#rows,#groups) AS ng;
[Parameters]:
#rows = BIGINT; the number of rows to be "tiled" (have group number assigned to it)
#groups = BIGINT; requested number of tile groups (same as the parameter passed to NTILE)
[Returns]:
Inline Table Valued Function returns:
GroupNumber = BIGINT; a row number beginning with 1 and ending with #rows
Members = BIGINT; Number of possible distinct members in the group
Low = BIGINT; the lower-bound range
High = BIGINT; the Upper-bound range
[Dependencies]:
core.rangeAB (iTVF)
[Developer Notes]:
1. An inline derived tally table using a CTE or subquery WILL NOT WORK. NTally requires
a correctly indexed tally table named dbo.tally; if you have or choose to use a
permanent tally table with a different name or in a different schema make sure to
change the DDL for this function accordingly. The recomended number of rows is
1,000,000; below is the recomended DDL for dbo.tally. Note the "Beginning" and "End"
of tally code.To learn more about tally tables see:
http://www.sqlservercentral.com/articles/T-SQL/62867/
2. For best results a P.O.C. index should exists on the table that you are "tiling". For
more information about P.O.C. indexes see:
http://sqlmag.com/sql-server-2012/sql-server-2012-how-write-t-sql-window-functions-part-3
3. NGroupRangeAB is deterministic; for more about deterministic and nondeterministic functions
see https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
-----------------------------------------------------------------------------------------
--===== 1. Basic illustration of the relationship between core.NGroupRangeAB and NTILE.
-- Consider this query which assigns 3 "tile groups" to 10 rows:
DECLARE #rows BIGINT = 7, #tiles BIGINT = 3;
SELECT t.N, t.TileGroup
FROM ( SELECT r.RN, NTILE(#tiles) OVER (ORDER BY r.RN)
FROM core.rangeAB(1,#rows,1,1) AS r) AS t(N,TileGroup);
Results:
N TileGroup
--- ----------
1 1
2 1
3 1
4 2
5 2
6 3
7 3
To pivot these "equiheight histograms" into "equiwidth histograms" we could do this:
DECLARE #rows BIGINT = 7, #tiles BIGINT = 3;
SELECT TileGroup = t.TileGroup,
[Low] = MIN(t.N),
[High] = MAX(t.N),
Members = COUNT(*)
FROM ( SELECT r.RN, NTILE(#tiles) OVER (ORDER BY r.RN)
FROM core.rangeAB(1,#rows,1,1) AS r) AS t(N,TileGroup);
GROUP BY t.TileGroup;
Results:
TileGroup Low High Members
---------- ---- ----- -----------
1 1 3 3
2 4 5 2
3 6 7 2
This will return the same thing at a tiny fraction of the cost:
SELECT TileGroup = ng.GroupNumber,
[Low] = ng.[Low],
[High] = ng.[High],
Members = ng.Members
FROM core.NGroupRangeAB(1,#rows,#tiles) AS ng;
--===== 2.1. Divide 25 Rows into 3 groups
DECLARE #min BIGINT = 1, #max BIGINT = 25, #groups BIGINT = 4;
SELECT ng.GroupNumber, ng.Members, ng.low, ng.high
FROM core.NGroupRangeAB(#min,#max,#groups) AS ng;
--===== 2.2. Assign group membership to another table
DECLARE #min BIGINT = 1, #max BIGINT = 25, #groups BIGINT = 4;
SELECT
ng.GroupNumber, ng.low, ng.high, s.WidgetId, s.Price
FROM (VALUES('a',$12),('b',$22),('c',$9),('d',$2)) AS s(WidgetId,Price)
JOIN core.NGroupRangeAB(#min,#max,#groups) AS ng
ON s.Price BETWEEN ng.[Low] AND ng.[High]
ORDER BY ng.RN;
Results:
GroupNumber low high WidgetId Price
------------ ---- ----- --------- ---------------------
1 1 7 d 2.00
2 8 13 a 12.00
2 8 13 c 9.00
4 20 25 b 22.00
-----------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20190128 - Initial Creation; Final Tuning - Alan Burstein
****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT
RN = r.RN, -- Sort Key
GroupNumber = r.N2, -- Bucket (group) number
Members = g.S-ur.N+1, -- Count of members in this group
[Low] = r.RN*g.S+rc.N+ur.N, -- Lower boundary for the group (inclusive)
[High] = r.N2*g.S+rc.N -- Upper boundary for the group (inclusive)
FROM core.rangeAB(0,#groups-1,1,0) AS r -- Range Function
CROSS APPLY (VALUES((#max-#min)/#groups,(#max-#min)%#groups)) AS g(S,U) -- Size, Underflow
CROSS APPLY (VALUES(SIGN(SIGN(r.RN-g.U)-1)+1)) AS ur(N) -- get Underflow
CROSS APPLY (VALUES(#min+r.RN-(ur.N*(r.RN-g.U)))) AS rc(N); -- Running Count
GO

How to Group Items in a Count statement

I'm trying to create a query that will return Total Claims reported in 0-3 days, 4-7 days, 8-14 days, and 15+
Select DATEDiff(DD,LossDate,DateReported) As TimeToReport,Count(ClaimId) As Num from LossRun
where PolicyNum='1234567890'
And PolTerm='201403'
Group By DATEDiff(DD,LossDate,DateReported)
order by DATEDiff(DD,LossDate,DateReported);
This is what i get
TimeToReport NumofClaims
0 5
1 3
2 1
3 4
4 3
5 2
6 2
7 2
8 1
12 1
13 1
14 2
15 2
48 1
52 1
107 1
121 1
147 1
533 1
Basically i want to see the total for 0-3, 4-7,8-14,and the rest,,,, timeToReport
You can try to use SUM with CASW WHEN
select
SUM(CASW WHEN TimeToReport <= 3 THEN NumofClaims ELSE 0 END) '0~3 day',
SUM(CASW WHEN TimeToReport >= 4 AND TimeToReport <=7 THEN NumofClaims END) '4-7 days',
SUM(CASW WHEN TimeToReport >= 8 AND TimeToReport <=14 THEN NumofClaims ELSE 0 END) '8-14 days',
SUM(CASW WHEN TimeToReport >= 15 THEN NumofClaims ELSE 0 END) '15+ day'
from (
Select DATEDiff(DD,LossDate,DateReported) As TimeToReport,Count(ClaimId) As Num
from LossRun
where PolicyNum='1234567890'
And PolTerm='201403'
Group By DATEDiff(DD,LossDate,DateReported)
) t
The most simple way is going to be by creating your own temp table which includes the min and max for each bucket and then joining to it.
declare #t table (OrderedID int, EmpID int, EffDate date, Salary money)
insert into #t
values
(1,1234,'20150101',1)
,(2,1234,'20160101',2)
,(3,1234,'20170101',8)
,(4,1234,'20180101',15)
,(1,2351,'20150101',17)
,(5,1234,'20190101',4)
,(5,1234,'20190101',2)
,(5,1234,'20190101',9)
declare #Bin table (MinVal int, MaxVal int)
insert into #Bin
values
(1,3)
,(4,6)
,(7,9)
,(10,15)
,(15,20)
,(20,25)
Select
B.MinVal,count(T.EmpID) as EmpsInBin
From #t T
inner join #Bin B on T.Salary between B.MinVal and B.MaxVal
group by B.MinVal
Output
MinVal EmpsInBin
1 3
4 1
7 2
10 1
15 2

SQL Server Rank Function without taking into account some flagged values

Here is my problem: I have a list of flagged values, I want to see where those values would be in the case they weren't flagged. But I don't want the other flagged values to influence the order.
Note: Flagged values are the ones with CurrentPlace 10000
ID Value CurrentPlace
------------------------
1 2 1
2 8 3
3 3 2
4 4 10000
5 5 10000
6 10 10000
Using:
select *
from
(select
id, value,
rank() over (order by Value asc) as Rank
from
tbl1) r
where
r.ID in (select id from tbl1 where CurrentPlace = 10000)
Desired output:
ID Value Rank
------------------
4 4 3
5 5 3
6 10 4
But I'm getting this instead:
ID Value Rank
------------------
4 4 3
5 5 4
6 10 6
Any help will be appreciated
Thank you guys
I've solved with
SELECT ID, Value, Rank
FROM tbl1 a
CROSS APPLY
(SELECT isnull(max(currentPlace),0) + 1 AS Rank FROM tbl1 WHERE value < a.value and currentPlace <> 10000) b
WHERE a.CurrentPlace = 10000
Please feel free to comment this out.

how to acomplish a full series in sql

I want to achieve a full numeric scale from 0 to the max number in the table.
Let's say we have a table T with two fields named x and y
select x,y
from t
would show us lets say the results
X Y
3 11
5 23
7 45
9 1
10 34
I found this query to build sequential numbers:
With T_Misparim As
(Select 1 N
Union All
Select N+1 N
From T_Misparim
Where N<1000)
Select N
From T_Misparim
Option (MaxRecursion 0);
from this source : http://www.sqlserver.co.il/?p=3296
My bottom line is, how do i integrate the two queries into a single query to give
right outer join :
N X Y
0 null 0
1 null 0
2 null 0
3 3 11
4 null 0
5 5 23
6 null 0
7 7 45
8 null 0
9 9 1
10 10 34
You can just LEFT JOIN with the ordinal number CTE;
select 3 as X, 11 as Y into #TEST
insert #TEST values (5,23),(7,45),(9,1),(10,34)
;with NUMS(n) as (
select 0 union all
select 1 + n from NUMS where n < 50
)
select
NUMS.n N,
T.X,
isnull(T.Y, 0) Y
from NUMS
left join #TEST T on (T.X = NUMS.n)
option (maxrecursion 50)
For
N X Y
0 NULL 0
1 NULL 0
2 NULL 0
3 3 11
4 NULL 0
5 5 23
6 NULL 0
7 7 45
8 NULL 0
9 9 1
10 10 34

Resources