I have an Access database table which sometimes contains duplicate ProfileIDs. I would like to create a query that excludes one (or more, if necessary) of the duplicate records.
The condition for a duplicate record to be excluded is: if the PriceBefore and PriceAfter fields are NOT equal, they are considered duplicate. If they are equal, the duplicate field remains.
In the example table above, the records with ID 7 and 8 have the same ProfileIDs. For ID 8, PriceBefore and PriceAfter are not equal, so this record should be excluded from the query. For ID 7, the two are equal, so it remains. Also note that PriceBefore and PriceAfter for ID 4 are the same, but as the ProfileID is not a duplicate, the record must remain.
What is the best way to do this? I am happy to use multiple queries if necessary.
Create a pointer query. Call it pQuery:
SELECT ProfileID, Sum(1) as X
FROM MyTableName
HAVING Sum(1) > 1
This will give you the ProfileID of every record that's part of a dupe.
Next, find the records where prices don't match. Call this pNoMatchQuery:
SELECT MyTableName.*
FROM MyTableName
INNER JOIN pQuery
ON pQuery.ProfileID = MyTableName.ProfileID
WHERE PriceBefore <> PriceAfter
You now have a query of every record that should be excluded from your dataset. If you want to permanently delete all of these records, run a DELETE query where you inner join your source table to pNoMatchQuery:
Delete MyTableName.*
From MyTableName
Where Exists( Select 1 From pNoMatchQuery Where pNoMatchQuery.ID = MyTableName.ID ) = True
First, make absolutely sure that pQuery and pNoMatchQuery are returning what you expect before you delete anything from your source table, because once it's gone it's gone for good (unless you make a backup first, which I would highly suggest before you run that delete the first time).
Related
I have a table like this:
As you can see, the rows 2 and 3 are similar, and row 3 is a useless duplicate. My question is how can we delete row 3 only but keep row2 and row4 at the same time.
Like this:
Thanks for your help!
You don't have duplicates. If you had a heap table with identical records, then every value in one or more records would be the same. One means of dealing with this would be to add an identity column. Then the identity column can be used to remove some but not all of the duplicates.
In your case, you want to delete records if another record exists that is similar and perhaps has "better" data. You can use an EXISTS clause to do this. The logic below is not what you want, but it should give you the idea of how to handle this.
DELETE t
FROM MyTable t
WHERE t.BCT IS NULL -- delete only records with no values?
AND t.BCS IS NULL
AND EXISTS( -- another record with a value exists, so this one might not be needed?
SELECT *
FROM MyTable x
WHERE (x.BCT IS NOT NULL OR t.BCS IS NOT NULL)
AND x.portCode = t.portCode
AND x.effDate = t.effDate
AND LEFT(x.issueName, 26) = LEFT(t.issueName, 26)
)
I am trying to delete millions of records from 4 databases, and running into an unexpected error. I made a temp table that holds a list of all the id's I wish to delete:
CREATE TABLE #CaseList (case_id int)
INSERT INTO #CaseList
SELECT DISTINCT id
FROM my_table
WHERE <my criteria for choosing cases>
I have deleted all the associated records (with foreign key on case_id)
DELETE FROM image WHERE case_id in (SELECT case_id from #CaseList)
Then I'm deleting records from my_table in batches (so as not to blow up the transaction log - which despite my database being in Simple Mode - still grows when making changes like deletions):
DELETE FROM my_table WHERE id in (SELECT case_id
FROM #CaseList
ORDER by case_id
OFFSET 0 ROWS
FETCH NEXT 10000 ROWS ONLY)
This will work fine for one or three or five rounds (so I've deleted 10k-50k records), then will fail with this error message:
Msg 512, Level 16, State 1, Procedure trgd_image, Line 188
Subquery returned more than 1 value. This is not permitted when the subquery follows =, !=, <, <= , >, >= or when the subquery is used as an expression.
Which is really weird because as I said, I already deleted all the associated records from the image table. Then it gets weirder because if I select smaller batches, the deletion works without error.
I generally cut the FETCH NEXT n half (5k), then in half again (2500), then in half again (1200) etc. until it works
DELETE FROM my_table WHERE id in (SELECT case_id
FROM #CaseList
ORDER by case_id
OFFSET 50000 ROWS
FETCH NEXT 1200 ROWS ONLY)
Then repeat that amount until I get past where it failed, then turn it back up to 10000 and it will work again for a batch or three...
DELETE FROM my_table WHERE id in (SELECT case_id
FROM #CaseList
ORDER by case_id
OFFSET 60000 ROWS
FETCH NEXT 10000 ROWS ONLY)
then fail again with the same error... rinse, wash, and repeat.
What can cause that subquery error when there are NOT related records in the image table? Why would selecting the cases in smaller batches work "around it" and then allow larger batches again?
I would really love a solution to this so I can make a WHILE loop and run this deletion through the millions of rows that way instead of having to manage it manually which is going to take me weeks with millions of rows needed to be deleted out of 4 databases.
The query you're showing cannot produce the error you're seeing. If you're sure it is, you have a bug report. My guess is that trgd_image, Line 188 (or somewhere nearby) you'll find you're using a scalar comparison, = instead of in.
I also have some advice for you, free for the asking. I wrote lots of queries like yours, and never used anything like OFFSET 60000 ROWS FETCH NEXT 10000 ROWS ONLY. You don't need to, either, and your SQL will be easier to write if you don't.
First, unless your machine is seriously undersized for 2018 for the scale of data you're using, I think you'll find 100,000 row transactions are just fine. If not, at least try to understand why not. A machine managing many millions of rows ought to be able to deal with a 1% of them without breaking a sweat.
When you populate #CaseList, trap ##rowcount. Then you can print/record that, and compute the number of "chunks" in your work.
Ideally, though, there's no temporary table. Instead, those cases probably have some logical grouping you can operate on. They might have regions or owners or dates, whatever was used to select them in the first place. Iterate over that, e.g.
delete from T where id in (select id from S where user = 1
Once you do that, you can write a loop:
select #user = min(user) from S where ...
while #user is not NULL begin
print "deleting cases for user", #user
delete from T where id in (select id from S where user = #user)
select #u = #user
select #user = min(user) from S where ... and user > #u
end
That way, if the process blows up partway through -- for any reason -- you have a logical grouping of deletions and a clean break: you know all the cases for user (or whatever) less than #user are deleted, and you can look into what's wrong with the "current" one. Quite often, you'll discover that the problem isn't unique, and by solving it you'll prevent future problems with others.
I have two tables in two databases. One is in a "primary" database, the second is in an "archive" database. The structures are identical, only the data is different. The issue is the primary table has records that are duplicated in the archive table. I need to remove the duplicates from the primary table before archiving records to the archive table. The first attempt seemed to work well, but the activity log filled up, killing the process because there was simply too much going on. The DBAs told me to delete in chunks of 100,000.
I also found out that there may be differences in the records even though the primary keys match between the databases. These records need to be saved, not deleted as duplicates. This led me to the INTERSECT and EXCEPT clauses. I've tried these three samples, and even though each INTERSECT returns 100,000 records, the DELETE only deletes 94,000 to 95,000 records. I included some comments for each. The last one deletes everything from the primary database table.
-- Test #1: Doesn't delete 100,000 - around 94,000 or 95,000
DELETE TOP (100000) FROM [db_1].[table_1]
WHERE EXISTS (
-- Returns 100,000
SELECT * FROM [db_1].[table_1]
INTERSECT
SELECT * FROM [db_2].[table_1]
)
-- Test #2: Doesn't delete 100,000 - around 94,000 or 95,000
-- [test_data] returns 100,000 records
;WITH [test_data] AS (
-- Returns 100,000
SELECT TOP (100000) *
FROM (
SELECT * FROM [db_1].[table_1]
INTERSECT
SELECT * FROM [db_2].[table_1]
) [meh]
)
DELETE TOP (100000) FROM [db_1].[table_1]
WHERE EXISTS (
SELECT * FROM [test_data]
)
-- Test #3: Deletes everything from [db_1].[table_1]
-- The EXISTS clause returns 100,000
-- The DELETE statement deletes everything from [db_1].[table_1]
DELETE FROM [db_1].[table_1]
WHERE EXISTS (
SELECT TOP (100000) [meh].[address_rid]
,[meh].[REV]
FROM (
-- Returns 100,000
SELECT * FROM [db_1].[table_1]
INTERSECT
SELECT * FROM [db_2].[table_1]
) [meh]
)
Is this enough information, or should I add more? Thanks in advance...
* EDIT *
Crap!.OK, I appreciate the feedback and will try to be more thorough in my explanation. I need to delete records that are duplicated between the two database tables. With the requirement to check each record to make sure it is a true duplicate, no data changed while in the primary database, I found INTERSECT. What I understand is that it does a field-to-field comparison, verifying whether or not a record is a true duplicate between the database tables. Once that has been established, I want to delete the duplicated record from the primary database, leaving the other record in the archive database...
Primary database: Archive database:
table_1: table_1:
PK f_1 f_2 f_3 PK f_1 f_2 f_3
1 A B B 1 A B B
2 B D X 2 B D X
3 C G Y 13 Q M O
4 D J Z 14 S M K
In the above table sets, the duplication is with records 1 and 2. Since they both exist in both tables, I need to delete them from the primary database table. What I'm running into is some tables have millions of records and they need to be deleted in 100,000 record chunks, as per our DBAs. My thinking was that if I put the INTERSECT in the EXISTS clause, limiting the results of the INTERSECT to 100,000 records, the DELETE would delete that 100,000 record chunk. But it doesn't. It deletes maybe 94,000 to 95,000 records, then as it nears the end of the deletions, it starts deleting a few thosand at a time, winding down to a handful of records.
If I'm using it incorrectly, or if there's a better way to delete the duplicated records I'm all ears. As several of you have pointed out, I have a difficult time asking the questions correctly, which is why I don't ask very many anymore. I do appreciate any suggestions, comments advice and/or critiques. I'm not so proud as to reject any help. Unless it's a verbal attack on my dog. Then I might get upset! ;-) I hope this helps. If not, I'm sure I'll hear about it.
I have 2 tables in 2 database. The scheme for the tables is identical. There are no timestamps or last updated information. Table A is a live table, that is, it's updated in "the" program. Update records, insert records and delete records all happen in Table A. Table B is a backup made weekly. Is there a quick way to compare the 2 tables and give me results similar to:
I | 54
D | 55
U | 60
So record 54 in the live table is new, record 55 in the live table was deleted, record 60 in the live table was updated.
This needs to work in SQL Server 2008 and up.
Fields: id, first_name, last_name, phone, email, address_id, birth_date, last_visit, provider_id, comments
I have no control over the scheme. I have read-only access to Table A, read-write to Table B.
Would it be easier to store a hash of each Table A's rows rather than a full copy of the table? Generally speaking I need to know what rows have been updated/inserted and deleted without a build in timestamp. I have the weekly backup table to look at but I could create a hash table if needed.
Using two full joins the first one isvused to check just for id existance and identify inserts and deletes the second would be used for row equality.
In the example I have used checksum for simplicity but I recommend you read up on the cons of using it and consider alternatives like hashbytes or checking each column for equality
Select id, checksum(*) hash
Into #live
From live.dbo.tbl
Select id, checksum(*) hash
Into #archive
From archive.dbo.tbl
Select l1.id,
Case when l1.id is null then 'd'
when a1.id is null then 'I'
when a2.id is null then 'u' end change_type
From #live l1
Full Join #archive a1 On a1.id = l1.id
Full Join #archive a2 On a2.id = l1.id
And a2.hash = l1.hash
I'm going to recommend a tool, but it's not free, although it has a fully functioning 30 day trial period. If you're going to compare data in SQL Server tables, look at Red Gate's SQL Data Compare. It's not cheap, and it will pay for itself many times over. (If you need to compare schemas, their SQL Compare does that.)
Barring that, having a third table, where you write a compare query and select those in one table and not the other (with a field indicating that), those in the other table and not the first, and then comparing field by field to find those different -- well that should work too. It will take longer, but if it's just one one table, the time it takes to write that code should be less than what you'll pay for the Red Gate tools.
If there is a column or set of columns that can uniquely identify each row, then a series of sql statements could be written to identify the inserts, updates and deletes. If there isn't a unique row identifier or the unique identifier (for example, one of the columns that makes it unique) changes, then no.
anyone know how can i delete duplicate rows by writing new way from script below to improve performance.
DELETE lt1 FROM #listingsTemp lt1, #listingsTemp lt2
WHERE lt1.code = lt2.code and lt1.classification_id > lt2.classification_id and (lt1.fap < lt2.fap or lt1.fap = lt2.fap)
Delete Duplicate Rows in a SQL Table :
delete table_a
where rowid not in
(select min(rowid) from table_a
group by column1, column2);
1 - Create an Identity Column (ID) for your table (t1)
2 - Do a Group by on your table with your conditions and get IDs of duplicated records.
3 - Now, simply Delete records from t1 where IDs IN duplicated IDs set.
Look into BINARY_CHECKSUM .... you could possibly use it when creating your temp tables to more quickly determine if the data is the same.... for example create a new field in both temp tables storing the binary_checksum value... then just delete where those fields equal
The odiseh answer seems to be valid (+1), but if for some reason you can't alter the structure of the table (because you have not the code of the applications that are using it or something) you could write a job that run every night and delete the duplicates (using the Moayad Mardini code).