TSQL - extract data to table/view to speed up query - sql-server

I use this statement to create a list for excel
SELECT DISTINCT Year, Version
FROM myView
WHERE id <> 'old'
ORDER BY Year DESC, Version DESC
The problem is that the execution time is over 30s because of the almost 2 million rows.
The result has only around 1000 rows.
What are my options to extract only those two columns in order to speed up the execution time? I also need to make sure that inserts to the underlying table are recognized.
Do I need a new table to copy the values from the view? And a trigger to manage the updates?
Thank you

So, presumably there's a table with Year and id underlying your view. Given this (trivial) example:
CREATE TABLE myTable ([id] varchar(10), [Year] int, [Version] int);
Just create an index on that table that matches the way you're querying your data. Given your query of:
SELECT DISTINCT Year, Version
FROM myView
WHERE id <> 'old'
ORDER BY Year DESC, Version DESC
This query matches the WHERE and ORDER BY clauses and should give you all the performance you need:
IF EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[myTable]') AND name = N'IX_YearVersion_Filtered')
DROP INDEX [IX_YearVersion_Filtered] ON [dbo].[myTable] WITH ( ONLINE = OFF )
GO
CREATE NONCLUSTERED INDEX [IX_YearVersion_Filtered] ON [dbo].[myTable]
(
[Year] DESC,
[Version] DESC
)
WHERE ([id]<>'old')
GO

with cte_x
as
(SELECT Year, Version
FROM myView
WHERE id not in ('old')
group by Year, Version)
SELECT DISTINCT Year, Version
FROM cte_x
ORDER BY Year DESC, Version DESC

Related

Efficient limit result set in SQL window function

My question would be better served as a comment on Limit result set in sql window function , but I don't have the necessary reputation to comment.
Given a table of moving vehicle locations, for each vehicle I wish to find the most recent recorded position (and other data about the vehicle at that time). Based on answers in the other question, I can run a query like:
Table definition:
CREATE TABLE VehiclePositions
(
Id BIGINT NOT NULL,
VehicleID NVARCHAR(12) NULL,
Timestamp DATETIME NULL,
PositionX FLOAT NULL,
PositionY FLOAT NULL,
PositionZ SMALLINT NULL,
Speed SMALLINT NULL,
Heading SMALLINT NULL
)
Query:
select *
from
(select
*,
row_number() over (partition by VehicleID order by Timestamp desc) as ranking
from VehiclePositions) as x
where
ranking = 1
Now, the problem is that this does a full table scan. I thought that by creating an appropriate index, I could avoid this:
CREATE INDEX idx_VehicPosition ON VehiclePositions(VehicleID, Timestamp);
However, SQL Server will happily ignore this index in the query and still perform the stable scan.
Note: I can get SQL Server to use the index, but the code is rather ugly:
DECLARE #ids TABLE (id NVARCHAR(12) UNIQUE)
INSERT INTO #ids
SELECT DISTINCT VehicleID
FROM VehiclePositions
SELECT ep.*
FROM VehiclePositions vp
WHERE Timestamp = (SELECT Max(TimeStamp) FROM VehiclePositions vp2
WHERE vp2.VehicleID = vp.VehicleID)
AND VehicleID IN (SELECT DISTINCT id FROM #ids)
(The VehicleID IN... is because it seems SQL Server doesn't implement seek-skip optimisations. It still comes up with a pretty non-optimal query plan that visits the index twice, but at least it doesn't execute in linear time).
Is there a way to make SQL Server run the window function query intelligently?
I'm using SQL Server 2014...
Help will be appreciated
What i would do :
SELECT *
FROM
(SELECT MAX(Timestamp) as maxtime,
VehicleID
FROM VehiclePositions
GROUP BY VehicleID ) as maxed INNER JOIN
(SELECT Id ,
VehicleID ,
Timestamp ,
PositionX ,
PositionY,
PositionZ,
Speed ,
Heading
FROM VehiclePositions) as vals
ON maxed.maxtime = vals.Timestamp
AND maxed.VehicleID = vals.VehicleID
to my knowledge you cant get around your index getting scanned twice.
As long as you are selecting all vehicles from the table and are select all column (or at least columns that are not in your index), I would expect the table scan to keep popping up.
In many cases, that will actually be the most efficient query plan. Only if you have a many rows per vehicle (like several pages) a seek strategy might be faster.
If you do have a lot of rows per vehicle, you might consider partitioning your table on Timestamp...
You can filter results in windows function using 'qualify', as follows:
select *
from VehiclePositions
qualify row_number() over (partition by VehicleID order by Timestamp desc) = 1

Reverse of each value in a column

Suppose I have a table with even number of rows. For eg- a table Employee with two columns Name and EmpCode. The table looks like
Name EmpCode
Ajay 7
Vikash 5
Shalu 4
Hari 8
Anu 1
Puja 9
Now, I want my output in reverse of EmpCode like:
Name EmpCode
Ajay 9
Vikash 1
Shalu 8
Hari 4
Anu 5
Puja 7
I need to run this query in SQL Server.
As the OP hasn't replied, I'll post a little explanation for them instead. As everyone has eluded to, tables in SQL Server have no built in ordering. Your data is stored in what is known as a HEAP. This means, when you run a query without an ORDER BY your data can return in any order that the Server feels like. With small datasets this might be in the order you inserted it in, but that's just it (it might).
When you get to larger datasets, and when you have multiple cores running on the operation, then the order of a SELECT * FROM [Table]; is more likely to not be the order in insertion, and is more likely to be random which each instance of running the query. I have several tables where a SELECT TOP 1 *... will return a different row every time I run the query; even with the CLUSTERED INDEX.
The only, yes only, way to guarantee the order is by using ORDER BY. Now, you might have another column which you haven't shared that you can order by, but if not, perhaps this (very) simple example will at least assist you, if nothing else:
CREATE TABLE #Employee ([Name] varchar(10), EmpCode tinyint);
INSERT INTO #Employee
VALUES ('Ajay',7),
('Vikash',5),
('Shalu',4),
('Hari',8),
('Anu',1),
('Puja',9);
GO
--Just SELECT *. ORDER is NOT guaranteed, but, due to the low volume of data, will probably be in the order by insertion
SELECT *
FROM #Employee;
--But, we want to reverse the order, so, let's add an ORDER BY
SELECT *
FROM #Employee
ORDER BY [Name];
--Oh! That didn't work (duh). Let's try again
SELECT *
FROM #Employee
ORDER BY Empcode;
--Nope, this isn't working. That's because your data has nothing related to it's insertion order. So, let's give it one:
GO
DROP TABLE #Employee;
CREATE TABLE #Employee (ID int IDENTITY(1,1), --Oooo, what is this?
[Name] varchar(10),
EmpCode tinyint);
INSERT INTO #Employee
VALUES ('Ajay',7),
('Vikash',5),
('Shalu',4),
('Hari',8),
('Anu',1),
('Puja',9);
GO
--Now look
SELECT *
FROM #Employee;
--So, we can use an ORDER BY, and get the correct order too
SELECT [Name],
Empcode
FROM #Employee
ORDER BY ID;
--So, we got the right ORDER using an ORDER BY. Now we can do something about the ordering:
--We'll need a CTE for this:
WITH RNs AS(
SELECT *,
ROW_NUMBER() OVER (ORDER BY ID ASC) AS RN1,
ROW_NUMBER() OVER (ORDER BY ID DESC) AS RN2
FROM #Employee)
SELECT R1.[Name],
R2.EmpCode
FROM RNs R1
JOIN RNs R2 ON R1.RN1 = R2.RN2;
GO
DROP TABLE #Employee;

SQL Server: Find duplicates using group by and having count than delete them all but not first [duplicate]

I need to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows).
The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.
MyTable
RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
How can I do this?
Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable
LEFT OUTER JOIN (
SELECT MIN(RowId) as RowId, Col1, Col2, Col3
FROM MyTable
GROUP BY Col1, Col2, Col3
) as KeepRows ON
MyTable.RowId = KeepRows.RowId
WHERE
KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
MIN(RowId)
with
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
Another possible way of doing this is
;
--Ensure that any immediately preceding statement is terminated with a semicolon above
WITH cte
AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3
ORDER BY ( SELECT 0)) RN
FROM #MyTable)
DELETE FROM cte
WHERE RN > 1;
I am using ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in RowID order for example you could use ORDER BY RowID DESC
Execution Plans
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
The ROW_NUMBER solution will always give pretty much the same plan whereas the GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
No useful index on the partitioning columns
relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.
There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField
AND dupes.secondDupField = fullTable.secondDupField
AND dupes.uniqueField > fullTable.uniqueField
The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.
DELETE FROM TableName
WHERE ID NOT IN (SELECT MAX(ID)
FROM TableName
GROUP BY Column1,
Column2,
Column3
/*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
nullable. Because of semantics of NOT IN (NULL) including the clause
below can simplify the plan*/
HAVING MAX(ID) IS NOT NULL)
The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName,
COUNT(*) TotalCount
FROM YourTableName
GROUP BY YourColumnName
HAVING COUNT(*) > 1
ORDER BY COUNT(*) DESC
delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid
Postgres:
delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
DELETE LU
FROM (SELECT *,
Row_number()
OVER (
partition BY col1, col1, col3
ORDER BY rowid DESC) [Row]
FROM mytable) LU
WHERE [row] > 1
This will delete duplicate rows, except the first row
DELETE
FROM
Mytable
WHERE
RowID NOT IN (
SELECT
MIN(RowID)
FROM
Mytable
GROUP BY
Col1,
Col2,
Col3
)
Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)
I would prefer CTE for deleting duplicate rows from sql server table
strongly recommend to follow this article ::http://codaffection.com/sql-server-article/delete-duplicate-rows-in-sql-server/
by keeping original
WITH CTE AS
(
SELECT *,ROW_NUMBER() OVER (PARTITION BY col1,col2,col3 ORDER BY col1,col2,col3) AS RN
FROM MyTable
)
DELETE FROM CTE WHERE RN<>1
without keeping original
WITH CTE AS
(SELECT *,R=RANK() OVER (ORDER BY col1,col2,col3)
FROM MyTable)
 
DELETE CTE
WHERE R IN (SELECT R FROM CTE GROUP BY R HAVING COUNT(*)>1)
To Fetch Duplicate Rows:
SELECT
name, email, COUNT(*)
FROM
users
GROUP BY
name, email
HAVING COUNT(*) > 1
To Delete the Duplicate Rows:
DELETE users
WHERE rowid NOT IN
(SELECT MIN(rowid)
FROM users
GROUP BY name, email);
Quick and Dirty to delete exact duplicated rows (for small tables):
select distinct * into t2 from t1;
delete from t1;
insert into t1 select * from t2;
drop table t2;
I prefer the subquery\having count(*) > 1 solution to the inner join because I found it easier to read and it was very easy to turn into a SELECT statement to verify what would be deleted before you run it.
--DELETE FROM table1
--WHERE id IN (
SELECT MIN(id) FROM table1
GROUP BY col1, col2, col3
-- could add a WHERE clause here to further filter
HAVING count(*) > 1
--)
SELECT DISTINCT *
INTO tempdb.dbo.tmpTable
FROM myTable
TRUNCATE TABLE myTable
INSERT INTO myTable SELECT * FROM tempdb.dbo.tmpTable
DROP TABLE tempdb.dbo.tmpTable
I thought I'd share my solution since it works under special circumstances.
I my case the table with duplicate values did not have a foreign key (because the values were duplicated from another db).
begin transaction
-- create temp table with identical structure as source table
Select * Into #temp From tableName Where 1 = 2
-- insert distinct values into temp
insert into #temp
select distinct *
from tableName
-- delete from source
delete from tableName
-- insert into source from temp
insert into tableName
select *
from #temp
rollback transaction
-- if this works, change rollback to commit and execute again to keep you changes!!
PS: when working on things like this I always use a transaction, this not only ensures everything is executed as a whole, but also allows me to test without risking anything. But off course you should take a backup anyway just to be sure...
This query showed very good performance for me:
DELETE tbl
FROM
MyTable tbl
WHERE
EXISTS (
SELECT
*
FROM
MyTable tbl2
WHERE
tbl2.SameValue = tbl.SameValue
AND tbl.IdUniqueValue < tbl2.IdUniqueValue
)
it deleted 1M rows in little more than 30sec from a table of 2M (50% duplicates)
Using CTE. The idea is to join on one or more columns that form a duplicate record and then remove whichever you like:
;with cte as (
select
min(PrimaryKey) as PrimaryKey
UniqueColumn1,
UniqueColumn2
from dbo.DuplicatesTable
group by
UniqueColumn1, UniqueColumn1
having count(*) > 1
)
delete d
from dbo.DuplicatesTable d
inner join cte on
d.PrimaryKey > cte.PrimaryKey and
d.UniqueColumn1 = cte.UniqueColumn1 and
d.UniqueColumn2 = cte.UniqueColumn2;
Yet another easy solution can be found at the link pasted here. This one easy to grasp and seems to be effective for most of the similar problems. It is for SQL Server though but the concept used is more than acceptable.
Here are the relevant portions from the linked page:
Consider this data:
EMPLOYEE_ID ATTENDANCE_DATE
A001 2011-01-01
A001 2011-01-01
A002 2011-01-01
A002 2011-01-01
A002 2011-01-01
A003 2011-01-01
So how can we delete those duplicate data?
First, insert an identity column in that table by using the following code:
ALTER TABLE dbo.ATTENDANCE ADD AUTOID INT IDENTITY(1,1)
Use the following code to resolve it:
DELETE FROM dbo.ATTENDANCE WHERE AUTOID NOT IN (SELECT MIN(AUTOID) _
FROM dbo.ATTENDANCE GROUP BY EMPLOYEE_ID,ATTENDANCE_DATE)
This is the easiest way to delete duplicate record
DELETE FROM tblemp WHERE id IN
(
SELECT MIN(id) FROM tblemp
GROUP BY title HAVING COUNT(id)>1
)
Use this
WITH tblTemp as
(
SELECT ROW_NUMBER() Over(PARTITION BY Name,Department ORDER BY Name)
As RowNumber,* FROM <table_name>
)
DELETE FROM tblTemp where RowNumber >1
Here is another good article on removing duplicates.
It discusses why its hard: "SQL is based on relational algebra, and duplicates cannot occur in relational algebra, because duplicates are not allowed in a set."
The temp table solution, and two mysql examples.
In the future are you going to prevent it at a database level, or from an application perspective. I would suggest the database level because your database should be responsible for maintaining referential integrity, developers just will cause problems ;)
I had a table where I needed to preserve non-duplicate rows.
I'm not sure on the speed or efficiency.
DELETE FROM myTable WHERE RowID IN (
SELECT MIN(RowID) AS IDNo FROM myTable
GROUP BY Col1, Col2, Col3
HAVING COUNT(*) = 2 )
Oh sure. Use a temp table. If you want a single, not-very-performant statement that "works" you can go with:
DELETE FROM MyTable WHERE NOT RowID IN
(SELECT
(SELECT TOP 1 RowID FROM MyTable mt2
WHERE mt2.Col1 = mt.Col1
AND mt2.Col2 = mt.Col2
AND mt2.Col3 = mt.Col3)
FROM MyTable mt)
Basically, for each row in the table, the sub-select finds the top RowID of all rows that are exactly like the row under consideration. So you end up with a list of RowIDs that represent the "original" non-duplicated rows.
The other way is Create a new table with same fields and with Unique Index. Then move all data from old table to new table. Automatically SQL SERVER ignore (there is also an option about what to do if there will be a duplicate value: ignore, interrupt or sth) duplicate values. So we have the same table without duplicate rows. If you don't want Unique Index, after the transfer data you can drop it.
Especially for larger tables you may use DTS (SSIS package to import/export data) in order to transfer all data rapidly to your new uniquely indexed table. For 7 million row it takes just a few minute.
By useing below query we can able to delete duplicate records based on the single column or multiple column. below query is deleting based on two columns. table name is: testing and column names empno,empname
DELETE FROM testing WHERE empno not IN (SELECT empno FROM (SELECT empno, ROW_NUMBER() OVER (PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
or empname not in
(select empname from (select empname,row_number() over(PARTITION BY empno ORDER BY empno)
AS [ItemNumber] FROM testing) a WHERE ItemNumber > 1)
Create new blank table with the same structure
Execute query like this
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) > 1
Then execute this query
INSERT INTO tc_category1
SELECT *
FROM tc_category
GROUP BY category_id, application_id
HAVING count(*) = 1
Another way of doing this :--
DELETE A
FROM TABLE A,
TABLE B
WHERE A.COL1 = B.COL1
AND A.COL2 = B.COL2
AND A.UNIQUEFIELD > B.UNIQUEFIELD
I would mention this approach as well as it can be helpful, and works in all SQL servers:
Pretty often there is only one - two duplicates, and Ids and count of duplicates are known. In this case:
SET ROWCOUNT 1 -- or set to number of rows to be deleted
delete from myTable where RowId = DuplicatedID
SET ROWCOUNT 0
From the application level (unfortunately). I agree that the proper way to prevent duplication is at the database level through the use of a unique index, but in SQL Server 2005, an index is allowed to be only 900 bytes, and my varchar(2048) field blows that away.
I dunno how well it would perform, but I think you could write a trigger to enforce this, even if you couldn't do it directly with an index. Something like:
-- given a table stories(story_id int not null primary key, story varchar(max) not null)
CREATE TRIGGER prevent_plagiarism
ON stories
after INSERT, UPDATE
AS
DECLARE #cnt AS INT
SELECT #cnt = Count(*)
FROM stories
INNER JOIN inserted
ON ( stories.story = inserted.story
AND stories.story_id != inserted.story_id )
IF #cnt > 0
BEGIN
RAISERROR('plagiarism detected',16,1)
ROLLBACK TRANSACTION
END
Also, varchar(2048) sounds fishy to me (some things in life are 2048 bytes, but it's pretty uncommon); should it really not be varchar(max)?
DELETE
FROM
table_name T1
WHERE
rowid > (
SELECT
min(rowid)
FROM
table_name T2
WHERE
T1.column_name = T2.column_name
);
CREATE TABLE car(Id int identity(1,1), PersonId int, CarId int)
INSERT INTO car(PersonId,CarId)
VALUES(1,2),(1,3),(1,2),(2,4)
--SELECT * FROM car
;WITH CTE as(
SELECT ROW_NUMBER() over (PARTITION BY personid,carid order by personid,carid) as rn,Id,PersonID,CarId from car)
DELETE FROM car where Id in(SELECT Id FROM CTE WHERE rn>1)
I you want to preview the rows you are about to remove and keep control over which of the duplicate rows to keep. See http://developer.azurewebsites.net/2014/09/better-sql-group-by-find-duplicate-data/
with MYCTE as (
SELECT ROW_NUMBER() OVER (
PARTITION BY DuplicateKey1
,DuplicateKey2 -- optional
ORDER BY CreatedAt -- the first row among duplicates will be kept, other rows will be removed
) RN
FROM MyTable
)
DELETE FROM MYCTE
WHERE RN > 1

SQL - Assign Unique ID for DISTINCT records

I need to create at temp table that has two columns: language_id (number) and language (text). I have a customer table that contains my language column. I need to populate my temp table with distinct records from my language column and I need to be able to assign a language_id for each distinct language record.
I am using 'SELECT DISTINCT Language from CustomerData' to get distinct records, but I am not certain how to assign a language_id for each distinct record.
My desired output is below
Language ID Language
1 English
2 French
3 Spanish
Any help would be much appreciated. Thank you
it's Simple group by "Language" on "CustomerData" data and ROW_NUMBER() to assign DISTINCT row number:
select ROW_NUMBER() OVER( ORDER BY Language asc) as ID, Language
from CustomerData
group by Language
If you want it in a table, set Language_ID as IDENTITY. If you want the result of a query, try this:
SELECT
t.Language
, ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS Language_ID
FROM
(SELECT DISTINCT Language FROM CUSTOMER_DATA) t
You can use the below query to achieve the same
INSERT INTO temp_table
SELECT DISTINCT DENSE_RANK() OVER(ORDER BY [Language] ) Language_ID,
,[Language]
FROM CustomerData
ORDER BY Language_ID
Otherwise as suggested in comments you can use Language_ID as IDENTITY column and use the simple query below
INSERT INTO temp_table(Language)
SELECT DISTINCT [Language]
FROM CustomerData
ORDER BY CustomerData
SELECT
[Language ID] = ROW_NUMBER() OVER(ORDER BY [Language])
,[Language] = [Language]
FROM
[CustomerData]
GROUP BY
[Language];
You must be having master / lookup table to gt this ID from that table.
Incase you have it, i would recommoned to use that table and JOIN it in your query to populate language id.
In case you don't have a master table for Language,
You can use Row_number() to get distinct rows with LanguageID
select ROW_NUMBER() OVER (order by Language ASC) , Language from
(
SELECT DISTINCT Language from CustomerData
)
Create the temporary table:
CREATE TABLE ##TempLanguage (
[LanguageId] [int] IDENTITY(1,1) NOT NULL, -- note use of IDENTITY field to assign LanguageId value
[Language] [nvarchar](100) NOT NULL,
) ON [PRIMARY]
Load the temporary table:
INSERT ##TempLanguage (
[Language]
)
SELECT DISTINCT [Language] FROM [CustomerData] ORDER BY [Language];
Display the temporary table's contents:
SELECT [LanguageId], [Language] FROM ##TempLanguage;
Remove the temporary table:
DROP TABLE ##TempLanguage
Note that beginnning the name of a temporary table with a double hash-tag, such as ##TempLanguage, creates a temporary table that persists between database connections and can be shared between database connections. Beginning the name of a temporary table with a single hash-tag, such as #TempLanguage creates a temporary table that persists for the length of the current connection, and is visible only to the current database connection.
In addition to all the existing answers, you can add an identity column to your table afterwards.
ALTER TABLE <table-name> ADD
LanguageId int IDENTITY(1, 1) NOT NULL
This will assign an incremental id to all rows currently in the table when the column is added.

How to remove one record so my unique key constraint won't break in the future

I have a table, Core_Faculty with 4 fields: ID (PK - INT), InstitutionID (INT), PersonID (INT), DeprecatedDate (SMALLDATETIME)
What I'd like to do is delete all the records for institution/person combinations that have both deprecated records and non-deprecated (DeprecatedDate IS NULL) record, but keep the non-deprecated record.
If an institution/person combination has only just one record (whether deprecated or not), I'd like to keep those and leave them alone. I'm only considering records that have both DeprecatedDate IS NULL and Deprecated IS NOT NULL for each unique institution/person combination.
End goal is to be left with one record per institution/person combination whether deprecated or not, but giving priority to the record that has a NULL deprecated date. These are the good, live records. However, if we are starting with only one record and it's deprecated, go ahead and keep it.
The database currently only can potentially have one of each as institution/person/deprecateddate is a unique key on the table.
How would I go about solving this, and what methods can I use to find the appropriate records, while only considering records that have both deprecated and non-deprecated values for the combination?
DELETE f
FROM
Core_Faculty f
INNER JOIN
(
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY
f.InstitutionID,
f.PersonID
ORDER BY
CASE
WHEN f.DeprecatedDate IS NULL THEN 1
ELSE 2
END,
f.DeprecatedDate
) RowNum
FROM
Core_Faculty f
) d ON
f.ID = d.ID
WHERE
d.RowNum > 1;
In SQL Server you can use a common table expression with a ROW_NUMBER function to identify the rows you want to keep:
WITH cte AS (
SELECT [ID]
,[InstitutionID]
,[PersonID]
,[DeprecatedDate]
,ROW_NUMBER() OVER (PARTITION BY [InstitutionID], [PersonID]
ORDER BY [DeprecatedDate] DESC) as [RowNumber]
FROM [Blog].[dbo].[Core_Faculty]
)
SELECT [ID]
,[InstitutionID]
,[PersonID]
,[DeprecatedDate]
,[RowNumber]
FROM cte
--WHERE [RowNumber] = 1
The ORDER BY [DeprecatedDate] DESC part will make ensure the latest record is the 1st row in the [InstitutionID], [PersonID] grouping. If there is only one row, even if it is a null, it will be kept since it is the 1st row in the grouping.
You can then use
DELETE
FROM cte
WHERE [RowNumber] > 1
instead of the select to remove the rest of the rows. Leaving you with just one row person/institution combo.

Resources