Delete duplicates MS-SQL with minimum date and multiple keys - sql-server

I have a big table that has duplicates as such:
number (primary key),
group (primary key),
Division (primary key),
dateChange.
Example:
1,2,3,20121015
1,2,3,20120101
1,2,3,20110101
2,2,2,20121010
2,2,2,20120101
result should be:
1,2,3,20121015
2,2,2,20121010
I have tried many combinations including group by the primary key with minimum "changeDate"
but nothing seems to work perfectly.
I want to have something like this:
delete from table where (number,group.devision,changeDate) not in
(select from table(number,group,devision,Max(changeDate))
group by (number,group.devision)
But I dont think it is a valid MS-SQL syntax.
Your help will be very appreciated!!

To delete all rows except for the latest for a number, group, Division combination.
;WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY number, group, Division
ORDER BY dateChange DESC) RN
FROM YourTable)
DELETE FROM cte
WHERE RN > 1

The following should work.
delete table from
table inner join (select
number, group, division, changeDate, row_number() over
(partition by number, group, division order by changeDate desc) as ranker
from table) Z
on table.number = Z.number and table.group = Z.group and
table.changeDate = Z.changeDate and Z.ranker != 1

Related

How does order by work when all column values are identical?

I use SQL Server 2016. Below is the rows in table: test_account. You can see the values of updDtm and fileCreateTime are identical. id is the primary key.
id accno updDtm fileCreatedTime
-----------------------------------------------------------------------
1 123456789 2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000
2 123456789 2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000
3 123456789 2022-07-27 09:41:10.0000000 2022-07-27 11:33:33.8300000
I want to query the latest account id which accno is 123456789 order by updDtm, fileCreatedTime
I run the following SQL, the output result is id = 1
SELECT t.id
FROM
(SELECT
ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC) AS seq,
a.id, a.accno, a.updDtm, a.fileCreatedTime
FROM
test_account a) AS t
WHERE t.seq = 1
My question is does the query result is repeatable and reliable (always output id=1 either run 1 time or multiple times) when the values of columns updDtm and fileCreatedTime are identical or just output the random id?
I read some articles and learn that for MySql and Oracle the query result is not reliable and reproducible. How about SQL Server?
The context of this documentation reference is ORDER BY usage with OFFSET and FETCH but the same considerations apply to all ORDER BY usage, including windowing functions like ROW_NUMBER(). In summary,
To achieve stable results between query requests, the following conditions must be met:
The underlying data that is used by the query must not change.
The ORDER BY clause contains a column or combination of columns that are guaranteed to be unique.
I'm trying to find an case to test if the query would output result
other than id=1 but with no luck
The ordering of rows when duplicate ORDER BY values exist is undefined (a.k.a. non-deterministic and arbitrary) because it depends on the execution plan (which may vary due to available indexes, stats, and the optimizer), parallelism, database engine internals, and even physical data storage. The example below yields different results due to a parallel plan on my test instance.
DROP TABLE IF EXISTS dbo.test_account;
CREATE TABLE dbo.test_account(
id int NOT NULL
CONSTRAINT pk_test_account PRIMARY KEY CLUSTERED
, accno int NOT NULL
, updDtm datetime2 NOT NULL
, fileCreatedTime datetime2 NOT NULL
);
--insert 100K rows
WITH
t10 AS (SELECT n FROM (VALUES(0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) t(n))
,t1k AS (SELECT 0 AS n FROM t10 AS a CROSS JOIN t10 AS b CROSS JOIN t10 AS c)
,t1g AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) AS num FROM t1k AS a CROSS JOIN t1k AS b CROSS JOIN t1k AS c)
INSERT INTO dbo.test_account (id, accno, updDtm, fileCreatedTime)
SELECT num, 123456789, '2022-07-27 09:41:10.0000000', '2022-07-27 11:33:33.8300000'
FROM t1g
WHERE num <= 100000;
GO
--run query 10 times
SELECT t.id
FROM
(SELECT
ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC) AS seq,
a.id, a.accno, a.updDtm, a.fileCreatedTime
FROM
test_account a) AS t
WHERE t.seq = 1;
GO 10
Example results:
1
27001
25945
57071
62813
1
1
1
36450
78805
The simple solution is to add the primary key as the last column to the ORDER BY clause to break ties. This returns the same id value (1) in every iteration regardless of the execution plan and indexes.
SELECT t.id
FROM
(SELECT
ROW_NUMBER() OVER(PARTITION BY a.accno ORDER BY a.updDtm desc, a.fileCreatedTime DESC, a.id) AS seq,
a.id, a.accno, a.updDtm, a.fileCreatedTime
FROM
test_account a) AS t
WHERE t.seq = 1;
GO 10
On a side note, this index will optimize the query:
CREATE NONCLUSTERED INDEX idx ON dbo.test_account(accno, updDtm DESC, fileCreatedTime DESC, id);

Update a multiple records with duplicate column value

I have a query that identify how many times a ChassisNo was use:
Query:
SELECT
ROW_NUMBER() OVER (
PARTITION BY ChassisNo
ORDER BY datecreated ASC
) row_num,
CollateralType,
LoanID,
ClientID,
CollateralID,
PlateNo,
ChassisNo,
EngineNo,
datecreated,
PreparedBy
FROM
TestAllLoanWithCollaterals
Result:
I highlighted an example of duplicated chassisno three times, some of the chassisno are duplicated 5 times or so, but the main thing is, how can I update all records with the same details with the latest chassisno
Expected result
based on the highlighted example above:
The yellow highlight is the latest record based on the datecreated column and always the last row_num of each chassisno. the blue highlight is the columns that should be updated.
I am thinking of using the Database Cursor but I don't think it is possible.
You may use an update join involving your original table and the logic you have already defined:
WITH cte AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY ChassisNo ORDER BY datecreated DESC) rn
FROM TestAllLoanWithCollaterals
)
UPDATE a
SET
CollateralType = b.CollateralType
LoanID = b.LoanID
ClientID = b.ClientID
CollateralID = b.CollateralID
PlateNo = b.PlateNo
EngineNo = b.EngineNo
datecreated = b.datecreated
PreparedBy = b.PreparedBy
FROM TestAllLoanWithCollaterals a
INNER JOIN cte b
ON a.ChassisNo = b.ChassisNo
WHERE
b.rn = 1;
Note that the above update logic simply overwrites all fields among duplicate by chassis to use those of the record which were most recently updated in the group.

T-SQL - Get last as-at date SUM(Quantity) was not negative

I am trying to find a way to get the last date by location and product a sum was positive. The only way i can think to do it is with a cursor, and if that's the case I may as well just do it in code. Before i go down that route, i was hoping someone may have a better idea?
Table:
Product, Date, Location, Quantity
The scenario is; I find the quantity by location and product at a particular date, if it is negative i need to get the sum and date when the group was last positive.
select
Product,
Location,
SUM(Quantity) Qty,
SUM(Value) Value
from
ProductTransactions PT
where
Date <= #AsAtDate
group by
Product,
Location
i am looking for the last date where the sum of the transactions previous to and including it are positive
Based on your revised question and your comment, here another solution I hope answers your question.
select Product, Location, max(Date) as Date
from (
select a.Product, a.Location, a.Date from ProductTransactions as a
join ProductTransactions as b
on a.Product = b.Product and a.Location = b.Location
where b.Date <= a.Date
group by a.Product, a.Location, a.Date
having sum(b.Value) >= 0
) as T
group by Product, Location
The subquery (table T) produces a list of {product, location, date} rows for which the sum of the values prior (and inclusive) is positive. From that set, we select the last date for each {product, location} pair.
This can be done in a set based way using windowed aggregates in order to construct the running total. Depending on the number of rows in the table this could be a bit slow but you can't really limit the time range going backwards as the last positive date is an unknown quantity.
I've used a CTE for convenience to construct the aggregated data set but converting that to a temp table should be faster. (CTEs get executed each time they are called whereas a temp table will only execute once.)
The basic theory is to construct the running totals for all of the previous days using the OVER clause to partition and order the SUM aggregates. This data set is then used and filtered to the expected date. When a row in that table has a quantity less than zero it is joined back to the aggregate data set for all previous days for that product and location where the quantity was greater than zero.
Since this may return multiple positive date rows the ROW_NUMBER() function is used to order the rows based on the date of the positive quantity day. This is done in descending order so that row number 1 is the most recent positive day. It isn't possible to use a simple MIN() here because the MIN([Date]) may not correspond to the MIN(Quantity).
WITH x AS (
SELECT [Date],
Product,
[Location],
SUM(Quantity) OVER (PARTITION BY Product, [Location] ORDER BY [Date] ASC) AS Quantity,
SUM([Value]) OVER(PARTITION BY Product, [Location] ORDER BY [Date] ASC) AS [Value]
FROM ProductTransactions
WHERE [Date] <= #AsAtDate
)
SELECT [Date], Product, [Location], Quantity, [Value], Positive_date, Positive_date_quantity
FROM (
SELECT x1.[Date], x1.Product, x1.[Location], x1.Quantity, x1.[Value],
x2.[Date] AS Positive_date, x2.[Quantity] AS Positive_date_quantity,
ROW_NUMBER() OVER (PARTITION BY x1.Product, x1.[Location] ORDER BY x2.[Date] DESC) AS Positive_date_row
FROM x AS x1
LEFT JOIN x AS x2 ON x1.Product=x2.Product AND x1.[Location]=x2.[Location]
AND x2.[Date]<x1.[Date] AND x1.Quantity<0 AND x2.Quantity>0
WHERE x1.[Date] = #AsAtDate
) AS y
WHERE Positive_date_row=1
Do you mean that you want to get the last date of positive quantity come to positive in group?
For example, If you are using SQL Server 2012+:
In following scenario, when the date going to 01/03/2017 the summary of quantity come to 1(-10+5+6).
Is it possible the quantity of following date come to negative again?
;WITH tb(Product, Location,[Date],Quantity) AS(
SELECT 'A','B',CONVERT(DATETIME,'01/01/2017'),-10 UNION ALL
SELECT 'A','B','01/02/2017',5 UNION ALL
SELECT 'A','B','01/03/2017',6 UNION ALL
SELECT 'A','B','01/04/2017',2
)
SELECT t.Product,t.Location,SUM(t.Quantity) AS Qty,MIN(CASE WHEN t.CurrentSum>0 THEN t.Date ELSE NULL END ) AS LastPositiveDate
FROM (
SELECT *,SUM(tb.Quantity)OVER(ORDER BY [Date]) AS CurrentSum FROM tb
) AS t GROUP BY t.Product,t.Location
Product Location Qty LastPositiveDate
------- -------- ----------- -----------------------
A B 3 2017-01-03 00:00:00.000

GROUP BY doesn't contain specific column

I have the following statement in MSSQL
SELECT a, b, MAX(t)
FROM table
GROUP BY a, b
What I want is just to show c and d columns for each specific row in the result. How can I do that?
It sounds like you're looking for ROW_NUMBER() or RANK() (the former will ignore ties, the latter will include them), something like:
;With Ranked as (
SELECT a,b,c,d,t,
ROW_NUMBER() OVER (PARTITION BY a,b
ORDER BY t desc) as rn
FROM table
)
SELECT * from Ranked where rn = 1
Which will return one row for each unique combination of the a,b columns, choosing the other values such that they come from the row with the highest t value (and, as I say, this variant ignores ties).

SQL Select set of records from one table, join each record to top 1 record of second table matching 1 column, sorted by a column in the second table

This is my first question on here, so I apologize if I break any rules.
Here's the situation. I have a table that lists all the employees and the building to which they are assigned, plus training hours, with ssn as the id column, I have another table that list all the employees in the company, also with ssn, but including name, and other personal data. The second table contains multiple records for each employee, at different points in time. What I need to do is select all the records in the first table from a certain building, then get the most recent name from the second table, plus allow the result set to be sorted by any of the columns returned.
I have this in place, and it works fine, it is just very slow.
A very simplified version of the tables are:
table1 (ssn CHAR(9), buildingNumber CHAR(7), trainingHours(DEC(5,2)) (7200 rows)
table2 (ssn CHAR(9), fName VARCHAR(20), lName VARCHAR(20), sequence INT) (708,000 rows)
The sequence column in table 2 is a number that corresponds to a predetermined date to enter these records, the higher number, the more recent the entry. It is common/expected that each employee has several records. But several may not have the most recent(i.e. '8').
My SProc is:
#BuildingNumber CHAR(7), #SortField VARCHAR(25)
BEGIN
DECLARE #returnValue TABLE(ssn CHAR(9), buildingNumber CAHR(7), fname VARCHAR(20), lName VARCHAR(20), rowNumber INT)
INSERT INTO #returnValue(...)
SELECT(ssn,buildingNum,fname,lname,rowNum)
FROM SELECT(...,CASE #SortField Row_Number() OVER (PARTITION BY buildingNumber ORDER BY {sortField column} END AS RowNumber)
FROM table1 a
OUTER APPLY(SELECT TOP 1 fName,lName FROM table2 WHERE ssn = a.ssn ORDER BY sequence DESC) AS e
where buildingNumber = #BuildingNumber
SELECT * from #returnValue ORDER BY RowNumber
END
I have indexes for the following:
table1: buildingNumber(non-unique,nonclustered)
table2: sequence_ssn(unique,nonclustered)
Like I said this gets me the correct result set, but it is rather slow. Is there a better way to go about doing this?
It's not possible to change the database structure or the way table 2 operates. Trust me if it were it would be done. Are there any indexes I could make that would help speed this up?
I've looked at the execution plans, and it has a clustered index scan on table 2(18%), then a compute scalar(0%), then an eager spool(59%), then a filter(0%), then top n sort(14%).
That's 78% of the execution so I know it's in the section to get the names, just not sure of a better(faster) way to do it.
The reason I'm asking is that table 1 needs to be updated with current data. This is done through a webpage with a radgrid control. It has a range, start index, all that, and it takes forever for the users to update their data.
I can change how the update process is done, but I thought I'd ask about the query first.
Thanks in advance.
I would approach this with window functions. The idea is to assign a sequence number to records in the table with duplicates (I think table2), such as the most recent records have a value of 1. Then just select this as the most recent record:
select t1.*, t2.*
from table1 t1 join
(select t2.*,
row_number() over (partition by ssn order by sequence desc) as seqnum
from table2 t2
) t2
on t1.ssn = t1.ssn and t2.seqnum = 1
where t1.buildingNumber = #BuildingNumber;
My second suggestion is to use a user-defined function rather than a stored procedure:
create function XXX (
#BuildingNumber int
)
returns table as
return (
select t1.ssn, t1.buildingNum, t2.fname, t2.lname, rowNum
from table1 t1 join
(select t2.*,
row_number() over (partition by ssn order by sequence desc) as seqnum
from table2 t2
) t2
on t1.ssn = t1.ssn and t2.seqnum = 1
where t1.buildingNumber = #BuildingNumber;
);
(This doesn't have the logic for the ordering because that doesn't seem to be the central focus of the question.)
You can then call it as:
select *
from dbo.XXX(<building number>);
EDIT:
The following may speed it up further, because you are only selecting a small(ish) subset of the employees:
select *
from (select t1.*, t2.*, row_number() over (partition by ssn order by sequence desc) as seqnum
from table1 t1 join
table2 t2
on t1.ssn = t1.ssn
where t1.buildingNumber = #BuildingNumber
) t
where seqnum = 1;
And, finally, I suspect that the following might be the fastest:
select t1.*, t2.*, row_number() over (partition by ssn order by sequence desc) as seqnum
from table1 t1 join
table2 t2
on t1.ssn = t1.ssn
where t1.buildingNumber = #BuildingNumber and
t2.sequence = (select max(sequence) from table2 t2a where t2a.ssn = t1.ssn)
In all these cases, an index on table2(ssn, sequence) should help performance.
Try using some temp tables instead of the table variables. Not sure what kind of system you are working on, but I have had pretty good luck. Temp tables actually write to the drive so you wont be holding and processing so much in memory. Depending on other system usage this might do the trick.
Simple define the temp table using #Tablename instead of #Tablename. Put the name sorting subquery in a temp table before everything else fires off and make a join to it.
Just make sure to drop the table at the end. It will drop the table at the end of the SP when it disconnects, but it is a good idea to make tell it to drop to be on the safe side.

Resources