T-SQL - Deduplicate large table - sql-server

Sorry if this has already been asked. I see a lot of similar questions but none exactly like this one.I am trying to de-dup a large set (about 500 M) records:
Sample data:
CUST_ID PROD_TYPE VALUE DATE
------------------------------------
1 1 Y 5/1/2015 *
1 2 N 5/1/2015 *
1 1 N 5/2/2015 *
1 2 N 5/2/2015
1 1 Y 5/3/2015 *
1 2 Y 5/3/2015 *
1 1 Y 5/6/2015
1 2 N 5/6/2015 *
By CUST_ID and PROD_TYPE, I need to retain the initial records as well as any records having a changed VALUE (the records with the asterisks). There can sometimes be gaps between the dates. There are around 5M unique CUST_ID's.
Any help would be greatly appreciated.

Not sure why LAG isn't working for you, this returns your results:
with t as (
select 1 as CUST_ID, 1 as PROD_TYPE, 'Y' as VALUE, '5/1/2015' as [Date]
union
select 1, 2, 'N', '5/1/2015'
union
select 1, 1, 'N', '5/2/2015'
union
select 1, 2, 'N', '5/2/2015'
union
select 1,1, 'Y', '5/3/2015'
union
select 1, 2, 'Y','5/3/2015'
union
select 1,1, 'Y', '5/6/2015'
union
select 1, 2,'N','5/6/2015')
select
*,
case when
value <>
isnull(lag(value) over (partition by cust_id, prod_type order by [date]),'')
then 1 else 0
end as keep
from
t
order by
[date],
cust_id,
prod_type

Thanks Kyle, that is exactly correct, and I was able to use that as a solution to my problem. The issue I was having (not being familiar with lag) was that I had failed to provide a default, so the gap in dates was creating a NULL value which was giving me problems, but once I provided that, it worked like a charm. Thanks!

Related

Unexpected analytic function output in common table expression

In SQL Server 2019, analytic functions are not returning the results that I would expect in the context of recursive common table expressions. Consider the following non-recursive T-SQL query:
WITH SourceData (RowNum, Uniform, RowVal) AS (
SELECT 1, 'A', 'A' UNION ALL
SELECT 2, 'A', 'B' UNION ALL
SELECT 3, 'A', 'C' UNION ALL
SELECT 4, 'A', 'D'
),
RecursiveCte0 (RowNum, Uniform, RowVal, MinVal, SomeSum, RowNumCalc, RecursiveLevel) AS (
SELECT RowNum, Uniform, RowVal, RowVal, RowNum, CAST(RowNum AS BIGINT), 0
FROM SourceData
),
RecursiveCte1 (RowNum, Uniform, RowVal, MinVal, SomeSum, RowNumCalc, RecursiveLevel) AS (
SELECT * FROM RecursiveCte0
UNION ALL
SELECT
RowNum, Uniform, RowVal,
MIN(MinVal) OVER (PARTITION BY Uniform),
SUM(RowNum) OVER (PARTITION BY Uniform),
ROW_NUMBER() OVER (PARTITION BY Uniform ORDER BY RowNum),
RecursiveLevel + 1
FROM RecursiveCte0
)
SELECT *
FROM RecursiveCte1
ORDER BY RecursiveLevel, RowNum;
Results:
RowNum Uniform RowVal MinVal SomeSum RowNumCalc RecursiveLevel
1 A A A 1 1 0
2 A B B 2 2 0
3 A C C 3 3 0
4 A D D 4 4 0
1 A A A 10 1 1
2 A B A 10 2 1
3 A C A 10 3 1
4 A D A 10 4 1
As expected, the MIN, SUM, and ROW_NUMBER functions generate the appropriate values based on all rows from RecursiveCte0. I would expect the following recursive query to be logically identical to the non-recursive version above, but it produces different results:
WITH SourceData (RowNum, Uniform, RowVal) AS (
SELECT 1, 'A', 'A' UNION ALL
SELECT 2, 'A', 'B' UNION ALL
SELECT 3, 'A', 'C' UNION ALL
SELECT 4, 'A', 'D'
),
RecursiveCte (RowNum, Uniform, RowVal, MinVal, SomeSum, RowNumCalc, RecursiveLevel) AS (
SELECT RowNum, Uniform, RowVal, RowVal, RowNum, CAST(RowNum AS BIGINT), 0
FROM SourceData
UNION ALL
SELECT
RowNum, Uniform, RowVal,
MIN(MinVal) OVER (PARTITION BY Uniform),
SUM(RowNum) OVER (PARTITION BY Uniform),
ROW_NUMBER() OVER (PARTITION BY Uniform ORDER BY RowNum),
RecursiveLevel + 1
FROM RecursiveCte
WHERE RecursiveLevel < 1
)
SELECT *
FROM RecursiveCte
ORDER BY RecursiveLevel, RowNum;
Results:
RowNum Uniform RowVal MinVal SomeSum RowNumCalc RecursiveLevel
1 A A A 1 1 0
2 A B B 2 2 0
3 A C C 3 3 0
4 A D D 4 4 0
1 A A A 1 1 1
2 A B B 2 1 1
3 A C C 3 1 1
4 A D D 4 1 1
For each of the three analytic functions, it appears that the grouping is only being applied within the context of each individual row, rather than all of the rows at that level. This unexpected behavior also happens if I partition over (SELECT NULL). I would expect the analytic functions to apply to the entire recursion level, as per MSDN:
Analytic and aggregate functions in the recursive part of the CTE are
applied to the set for the current recursion level and not to the set
for the CTE. Functions like ROW_NUMBER operate only on the subset of
data passed to them by the current recursion level and not the entire
set of data passed to the recursive part of the CTE.
Why do these two queries produce different results? Is there a way to effectively use analytic functions with recursive common table expressions?

SQL Server - Behaviour of ROW_NUMBER Partition by Null Value

I find this behaviour very strange and counterintuitive. (Even for SQL).
set ansi_nulls off
go
;with sampledata(Value, CanBeNull) as
(
select 1, 1
union
select 2, 2
union
select 3, null
union
select 4, null
union
select 5, null
union
select 6, null
)
select ROW_NUMBER() over(partition by CanBeNull order by value) 'RowNumber',* from sampledata
Which returns
1 3 NULL
2 4 NULL
3 5 NULL
4 6 NULL
1 1 1
1 2 2
Which means that all of the nulls are being treated as part of the same group for the purpose of calculating the row number. It doesn't matter whether the SET ANSI_NULLLS is on or off.
But since by definition the null is totally unknown then how can the nulls be grouped together like this? It is saying that for the purposes of placing things in a rank order that apples and oranges and the square root of minus 1 and quantum black holes or whatever can be meaningfully ordered. A little experimentation suggests that the first column is being used to generate the rank order as
select 1, '1'
union
select 2, '2'
union
select 5, null
union
select 6, null
union
select 3, null
union
select 4, null
generates the same values. This has significant implications which have caused problems in legacy code I am dealing with. Is this the expected behaviour and is there any way of mitigating it other than replacing the null in the select query with a unique value?
The results I would have expected would have been
1 3 NULL
1 4 NULL
1 5 NULL
1 6 NULL
1 1 1
1 2 2
Using Dense_Rank() makes no difference.
Yo.
So the deal is that when T-SQL is dealing with NULLs in predicates, it uses ternary logic (TRUE, FALSE or UNKNOWN) and displays the behavior that you have stated that you expect from your query. However, when it comes to grouping values, T-SQL treats NULLs as one group. So your query will group the NULLs together and start numbering the rows within that window.
For the results that you say you would like to see, this query should work...
WITH sampledata (Value, CanBeNull)
AS
(
SELECT 1, 1
UNION
SELECT 2, 2
UNION
SELECT 3, NULL
UNION
SELECT 4, NULL
UNION
SELECT 5, NULL
UNION
SELECT 6, NULL
)
SELECT
DENSE_RANK() OVER (PARTITION BY CanBeNull ORDER BY CASE WHEN CanBeNull IS NOT NULL THEN value END ASC) as RowNumber
,Value
,CanBeNull
FROM sampledata

Search within ColA duplicates against specific unique vals in ColB to exclude all of ColA

I apologize in advance I feel like I'm missing something really stupid simple. (and let's ignore database structure as I'm kind of locked into that).
I have, let's use customer orders - an order number can be shipped to more than one place. For the sake of ease I'm just illustrating three but it could be more than that (home, office, gift, gift2, gift 3, etc)
So my table is:
Customer orders:
OrderID MailingID
--------------------
1 1
1 2
1 3
2 1
3 1
3 3
4 1
4 2
4 3
What I need to find is OrderIDs that have been shipped to MailingID 1 but not 2 (basically what I need to find is orderID 2 and 3 above).
If it matters, I'm using Sql Express 2012.
Thanks
Maybe this could help:
create table #temp(
orderID int,
mailingID int
)
insert into #temp
select 1, 1 union all
select 1, 2 union all
select 1, 3 union all
select 2, 1 union all
select 3, 1 union all
select 3, 3 union all
select 4, 1 union all
select 4, 2 union all
select 4, 3
-- find orderIDs that have been shipeed to mailingID = 1
select
distinct orderID
from #temp
where mailingID = 1
except
-- find orderIDs that have been shipeed to mailingID = 2
select
orderID
from #temp
where mailingID = 2
drop table #temp
A simple Subquery With NOT IN Operator should work.
SELECT DISTINCT OrderID
FROM <tablename> a
WHERE orderid NOT IN (SELECT orderid
FROM <tablename> b
WHERE b.mailingID = 2)

Tsql getting depending data over multiple rows in one query

I would like to calculate an average of a value in one year. I have a historical data table that saves the changes of the value in time.
I know how to do this with a (sub)query for each individual month, but Im hopeful that there is a simple way to do it in one query.
Example:
ID, Value, DateUntilActivity
1, 10.00, 2014-03-01
2, 5.00, 2014-05-01
3, 3.00, 2014-07-01
4, 12.00, 2014-10-01
So - the correct calculation here is:
(2x10.00 + 2x5.00 + 2x3.00 + 3x12.00 + 3x<current_value_in_a_different_table>)/12
The calculation includes the number of moths the data was active for - the first value, 10.00 was valid in 2 months - January and February.
And consider the value current_value_in_a_different_table a fixed value.
Also, it needs to work on MSSQL server 2005.
Thank you in advance!
;with cte as
(
select value, DateUntilActivity from yourtable
union
select 100 as currentvalue, '2015-1-1' from yourothertable
)
select avg(value)
from
(
select (select top 1 value from cte where DateUntilActivity>DATEADD(MONTH,number, '2014-1-1') order by DateUntilActivity ) as value
from master..spt_values
where type='p' and number <=11
) v
If my memory is wrong and you can't use a CTE, this is equivalent to
select avg(value)
from
(
select
(select top 1 value
from
(
select value, DateUntilActivity from yourtable
union
select 100 as currentvalue, '2015-1-1' from yourothertable
) v
where DateUntilActivity>DATEADD(MONTH,number, '2014-1-1') order by DateUntilActivity ) as value
from master..spt_values
where type='p' and number <=11
) v

Tsql group by clause with exceptions

I have a problem with a query.
This is the data (order by Timestamp):
Data
ID Value Timestamp
1 0 2001-1-1
2 0 2002-1-1
3 1 2003-1-1
4 1 2004-1-1
5 0 2005-1-1
6 2 2006-1-1
7 2 2007-1-1
8 2 2008-1-1
I need to extract distinct values and the first occurance of the date. The exception here is that I need to group them only if not interrupted with a new value in that timeframe.
So the data I need is:
ID Value Timestamp
1 0 2001-1-1
3 1 2003-1-1
5 0 2005-1-1
6 2 2006-1-1
I've made this work by a complicated query, but am sure there is an easier way to do it, just cant think of it. Could anyone help?
This is what I started with - probably could work with that. This is a query that should locate when a value is changed.
> SELECT * FROM Data d1 join Data d2 ON d1.Timestamp < d2.Timestamp and
> d1.Value <> d2.Value
It probably could be done with a good use of row_number clause but cant manage it.
Sample data:
declare #T table (ID int, Value int, Timestamp date)
insert into #T(ID, Value, Timestamp) values
(1, 0, '20010101'),
(2, 0, '20020101'),
(3, 1, '20030101'),
(4, 1, '20040101'),
(5, 0, '20050101'),
(6, 2, '20060101'),
(7, 2, '20070101'),
(8, 2, '20080101')
Query:
;With OrderedValues as (
select *,ROW_NUMBER() OVER (ORDER By TimeStamp) as rn --TODO - specific columns better than *
from #T
), Firsts as (
select
ov1.* --TODO - specific columns better than *
from
OrderedValues ov1
left join
OrderedValues ov2
on
ov1.Value = ov2.Value and
ov1.rn = ov2.rn + 1
where
ov2.ID is null
)
select * --TODO - specific columns better than *
from Firsts
I didn't rely on the ID values being sequential and without gaps. If that's the situation, you can omit OrderedValues (using the table and ID in place of OrderedValues and rn). The second query simply finds rows where there isn't an immediate preceding row with the same Value.
Result:
ID Value Timestamp rn
----------- ----------- ---------- --------------------
1 0 2001-01-01 1
3 1 2003-01-01 3
5 0 2005-01-01 5
6 2 2006-01-01 6
You can order by rn if you need the results in this specific order.

Resources