SQL Server - Behaviour of ROW_NUMBER Partition by Null Value - sql-server

I find this behaviour very strange and counterintuitive. (Even for SQL).
set ansi_nulls off
go
;with sampledata(Value, CanBeNull) as
(
select 1, 1
union
select 2, 2
union
select 3, null
union
select 4, null
union
select 5, null
union
select 6, null
)
select ROW_NUMBER() over(partition by CanBeNull order by value) 'RowNumber',* from sampledata
Which returns
1 3 NULL
2 4 NULL
3 5 NULL
4 6 NULL
1 1 1
1 2 2
Which means that all of the nulls are being treated as part of the same group for the purpose of calculating the row number. It doesn't matter whether the SET ANSI_NULLLS is on or off.
But since by definition the null is totally unknown then how can the nulls be grouped together like this? It is saying that for the purposes of placing things in a rank order that apples and oranges and the square root of minus 1 and quantum black holes or whatever can be meaningfully ordered. A little experimentation suggests that the first column is being used to generate the rank order as
select 1, '1'
union
select 2, '2'
union
select 5, null
union
select 6, null
union
select 3, null
union
select 4, null
generates the same values. This has significant implications which have caused problems in legacy code I am dealing with. Is this the expected behaviour and is there any way of mitigating it other than replacing the null in the select query with a unique value?
The results I would have expected would have been
1 3 NULL
1 4 NULL
1 5 NULL
1 6 NULL
1 1 1
1 2 2
Using Dense_Rank() makes no difference.

Yo.
So the deal is that when T-SQL is dealing with NULLs in predicates, it uses ternary logic (TRUE, FALSE or UNKNOWN) and displays the behavior that you have stated that you expect from your query. However, when it comes to grouping values, T-SQL treats NULLs as one group. So your query will group the NULLs together and start numbering the rows within that window.
For the results that you say you would like to see, this query should work...
WITH sampledata (Value, CanBeNull)
AS
(
SELECT 1, 1
UNION
SELECT 2, 2
UNION
SELECT 3, NULL
UNION
SELECT 4, NULL
UNION
SELECT 5, NULL
UNION
SELECT 6, NULL
)
SELECT
DENSE_RANK() OVER (PARTITION BY CanBeNull ORDER BY CASE WHEN CanBeNull IS NOT NULL THEN value END ASC) as RowNumber
,Value
,CanBeNull
FROM sampledata

Related

Filter table to show only most recent values [duplicate]

This question already has answers here:
Get top 1 row of each group
(19 answers)
Closed 11 months ago.
I have a table that looks like this.
Category
Type
fromDate
Value
1
1
1/1/2022
5
1
2
1/1/2022
10
2
1
1/1/2022
7.5
2
2
1/1/2022
15
3
1
1/1/2022
3.5
3
2
1/1/2022
5
3
1
4/1/2022
5
3
2
4/1/2022
10
I'm trying to filter this table down to filter down and keep the most recent grouping of Category/Type. IE rows 5 and 6 would be removed in the query since they are older records.
So far I have the below query but I am getting an aggregate error due to not aggregating the "Value" column. My question is how do I get around this without aggregating? I want to keep the actual value that is in the column.
SELECT T1.Category, T1.Type, T2.maxDate, T1.Value
FROM (SELECT Category, Type, MAX(fromDate) AS maxDate
FROM Table GROUP BY Category,Type) T2
INNER JOIN Table T1 ON T1.Category=T2.Category
GROUP BY T1.Category, T1.Type, T2.MaxDate
This has been asked and answered dozens and dozens of times. But it was quick and painless to type up an answer. This should work for you.
declare #MyTable table
(
Category int
, Type int
, fromDate date
, Value decimal(5,2)
)
insert #MyTable
select 1, 1, '1/1/2022', 5 union all
select 1, 2, '1/1/2022', 10 union all
select 2, 1, '1/1/2022', 7.5 union all
select 2, 2, '1/1/2022', 15 union all
select 3, 1, '1/1/2022', 3.5 union all
select 3, 2, '1/1/2022', 5 union all
select 3, 1, '4/1/2022', 5 union all
select 3, 2, '4/1/2022', 10
select Category
, Type
, fromDate
, Value
from
(
select *
, RowNum = ROW_NUMBER() over(partition by Category, Type order by fromDate desc)
from #MyTable
) x
where x.RowNum = 1
order by x.Category
, x.Type

T-SQL - Deduplicate large table

Sorry if this has already been asked. I see a lot of similar questions but none exactly like this one.I am trying to de-dup a large set (about 500 M) records:
Sample data:
CUST_ID PROD_TYPE VALUE DATE
------------------------------------
1 1 Y 5/1/2015 *
1 2 N 5/1/2015 *
1 1 N 5/2/2015 *
1 2 N 5/2/2015
1 1 Y 5/3/2015 *
1 2 Y 5/3/2015 *
1 1 Y 5/6/2015
1 2 N 5/6/2015 *
By CUST_ID and PROD_TYPE, I need to retain the initial records as well as any records having a changed VALUE (the records with the asterisks). There can sometimes be gaps between the dates. There are around 5M unique CUST_ID's.
Any help would be greatly appreciated.
Not sure why LAG isn't working for you, this returns your results:
with t as (
select 1 as CUST_ID, 1 as PROD_TYPE, 'Y' as VALUE, '5/1/2015' as [Date]
union
select 1, 2, 'N', '5/1/2015'
union
select 1, 1, 'N', '5/2/2015'
union
select 1, 2, 'N', '5/2/2015'
union
select 1,1, 'Y', '5/3/2015'
union
select 1, 2, 'Y','5/3/2015'
union
select 1,1, 'Y', '5/6/2015'
union
select 1, 2,'N','5/6/2015')
select
*,
case when
value <>
isnull(lag(value) over (partition by cust_id, prod_type order by [date]),'')
then 1 else 0
end as keep
from
t
order by
[date],
cust_id,
prod_type
Thanks Kyle, that is exactly correct, and I was able to use that as a solution to my problem. The issue I was having (not being familiar with lag) was that I had failed to provide a default, so the gap in dates was creating a NULL value which was giving me problems, but once I provided that, it worked like a charm. Thanks!

Search within ColA duplicates against specific unique vals in ColB to exclude all of ColA

I apologize in advance I feel like I'm missing something really stupid simple. (and let's ignore database structure as I'm kind of locked into that).
I have, let's use customer orders - an order number can be shipped to more than one place. For the sake of ease I'm just illustrating three but it could be more than that (home, office, gift, gift2, gift 3, etc)
So my table is:
Customer orders:
OrderID MailingID
--------------------
1 1
1 2
1 3
2 1
3 1
3 3
4 1
4 2
4 3
What I need to find is OrderIDs that have been shipped to MailingID 1 but not 2 (basically what I need to find is orderID 2 and 3 above).
If it matters, I'm using Sql Express 2012.
Thanks
Maybe this could help:
create table #temp(
orderID int,
mailingID int
)
insert into #temp
select 1, 1 union all
select 1, 2 union all
select 1, 3 union all
select 2, 1 union all
select 3, 1 union all
select 3, 3 union all
select 4, 1 union all
select 4, 2 union all
select 4, 3
-- find orderIDs that have been shipeed to mailingID = 1
select
distinct orderID
from #temp
where mailingID = 1
except
-- find orderIDs that have been shipeed to mailingID = 2
select
orderID
from #temp
where mailingID = 2
drop table #temp
A simple Subquery With NOT IN Operator should work.
SELECT DISTINCT OrderID
FROM <tablename> a
WHERE orderid NOT IN (SELECT orderid
FROM <tablename> b
WHERE b.mailingID = 2)

Sql Server Rank on Value Range

I have a table with three columns, ID, Date, Value. I want to rank the rows such that, within an ID, the Ranking goes up with each date where Value is at least X, otherwise, Ranking stays the same.
Given ID, Date, and Values like these
1, 6/1, 8
1, 6/2, 12
1, 6/3, 14
1, 6/4, 9
1, 6/5, 11
I would like to return a ranking based on values of at least 10, such that I would have ID, Date, Value, and Rank like this:
1, 6/1, 8, 0
1, 6/2, 12, 1
1, 6/3, 14, 2
1, 6/4, 9, 2
1, 6/5, 11, 3
In other words, the ranking increases each time the value exceeds a threshhold, otherwise it stays the same.
What I have tried is
SELECT T1.*, X.Ranking FROM TABLE T1
LEFT JOIN ( SELECT *, DENSE_RANK( ) OVER ( PARTITION BY T2.ID ORDER BY T2.DATE ) Ranking
FROM TABLE T2 WHERE T2.VALUE >= 10 ) X
ON T1.ID = T2.ID AND T1.Date = T2.Date
This almost works. It gets me output like
1, 6/1, 8, NULL
1, 6/2, 12, 1
1, 6/3, 14, 2
1, 6/4, 9, NULL
1, 6/5, 11, 3
Then, I want to turn the first NULL into a 0, and the second into a 2.
I turned the above query into a cte and tried
SELECT T1.*, CASE WHEN T1.Ranking IS NULL THEN ISNULL( (
SELECT MAX( T2.Ranking )
FROM cte T2 WHERE T1.ID = T2.ID AND T1.Date > T2.Date, 0 )
ELSE T1.Ranking END NewRanking
FROM cte T1
This looks like it would work, but my table has 200,000 rows and the query ran for 25 minutes... So, I'm looking for something a little more out of the box than the SELECT MAX.
You are using SQL Server 2012, so you can do a cumulative sum:
select t.*,
sum(case when value >= 10 then 1 else 0 end) over
(partition by id order by date) as ranking
from table t;
EDIT: This actually does not work. In spirit it fetches the previous LAG value and increment it, but this is not how LAG works... it would be 'recursive' in essence which results in a 'my_rank' is undefined syntax error. Better solution is the accepted answer based on a cumulative sum.
If you have SQL Server 2012 (you didn't tag your question), you can do something like:
SELECT
LAG(my_rank, 1, 0) OVER (ORDER BY DATE)
+ CASE WHEN VALUE >= 10 THEN 1 ELSE 0 END AS my_rank
FROM T1

Tsql group by clause with exceptions

I have a problem with a query.
This is the data (order by Timestamp):
Data
ID Value Timestamp
1 0 2001-1-1
2 0 2002-1-1
3 1 2003-1-1
4 1 2004-1-1
5 0 2005-1-1
6 2 2006-1-1
7 2 2007-1-1
8 2 2008-1-1
I need to extract distinct values and the first occurance of the date. The exception here is that I need to group them only if not interrupted with a new value in that timeframe.
So the data I need is:
ID Value Timestamp
1 0 2001-1-1
3 1 2003-1-1
5 0 2005-1-1
6 2 2006-1-1
I've made this work by a complicated query, but am sure there is an easier way to do it, just cant think of it. Could anyone help?
This is what I started with - probably could work with that. This is a query that should locate when a value is changed.
> SELECT * FROM Data d1 join Data d2 ON d1.Timestamp < d2.Timestamp and
> d1.Value <> d2.Value
It probably could be done with a good use of row_number clause but cant manage it.
Sample data:
declare #T table (ID int, Value int, Timestamp date)
insert into #T(ID, Value, Timestamp) values
(1, 0, '20010101'),
(2, 0, '20020101'),
(3, 1, '20030101'),
(4, 1, '20040101'),
(5, 0, '20050101'),
(6, 2, '20060101'),
(7, 2, '20070101'),
(8, 2, '20080101')
Query:
;With OrderedValues as (
select *,ROW_NUMBER() OVER (ORDER By TimeStamp) as rn --TODO - specific columns better than *
from #T
), Firsts as (
select
ov1.* --TODO - specific columns better than *
from
OrderedValues ov1
left join
OrderedValues ov2
on
ov1.Value = ov2.Value and
ov1.rn = ov2.rn + 1
where
ov2.ID is null
)
select * --TODO - specific columns better than *
from Firsts
I didn't rely on the ID values being sequential and without gaps. If that's the situation, you can omit OrderedValues (using the table and ID in place of OrderedValues and rn). The second query simply finds rows where there isn't an immediate preceding row with the same Value.
Result:
ID Value Timestamp rn
----------- ----------- ---------- --------------------
1 0 2001-01-01 1
3 1 2003-01-01 3
5 0 2005-01-01 5
6 2 2006-01-01 6
You can order by rn if you need the results in this specific order.

Resources