How to get rid of duplicates with T-SQL - sql-server

Hi I have a login table that has some duplicated username.
Yes I know I should have put a constraint on it, but it's a bit too late for that now!
So essentially what I want to do is to first identify the duplicates. I can't just delete them since I can't be too sure which account is the correct one. The accounts have the same username and both of them have roughly the same information with a few small variances.
Is there any way to efficiently script it so that I can add "_duplicate" to only one of the accounts per duplicate?

You can use ROW_NUMBER with a PARTITION BY in the OVER() clause to find the duplicates and an updateable CTE to change the values accordingly:
DECLARE #dummyTable TABLE(ID INT IDENTITY, UserName VARCHAR(100));
INSERT INTO #dummyTable VALUES('Peter'),('Tom'),('Jane'),('Victoria')
,('Peter') ,('Jane')
,('Peter');
WITH UpdateableCTE AS
(
SELECT t.UserName AS OldValue
,t.UserName + CASE WHEN ROW_NUMBER() OVER(PARTITION BY UserName ORDER BY ID)=1 THEN '' ELSE '_duplicate' END AS NewValue
FROM #dummyTable AS t
)
UPDATE UpdateableCTE SET OldValue = NewValue;
SELECT * FROM #dummyTable;
The result
ID UserName
1 Peter
2 Tom
3 Jane
4 Victoria
5 Peter_duplicate
6 Jane_duplicate
7 Peter_duplicate
You might include ROW_NUMBER() as another column to find the duplicates ordinal. If you've got a sort clause to get the earliest (or must current) numbered with 1 it should be easy to find and correct the duplicates.
Once you've cleaned this mess, you should ensure not to get new dups. But you know this already :-D

There is no easy way to get rid of this nightmare. Some manual actions required.
First identify duplicates.
select * from dbo.users
where userId in
(select userId from dbo.users
group by username
having count(userId) > 1)
Next identify "useless" users (for example those who registered but never place any order).
Rerun the query above. Out of this list find duplicates which are the same (by email for example) and combine them in a single record. If they did something useful previously (for example placed orders) then first assign these orders to a user which survive. Remove others.
Continue with other criteria until you you get rid of duplicates.
Then set unique constrain on username field. Also it is good idea to set unique constraint on email field.
Again, it is not easy and not automatic.

In this case where you duplicates and the original names have some variance it is highly impossible to select non duplicate rows since you are not aware which is real and which is duplicate.
I think the best thing to is to correct you data and then fix from where you are getting this slight variant duplicates.

Related

Highlight Duplicate Values in a NetSuite Saved Search

I am looking for a way to highlight duplicates in a NetSuite saved search. The duplicates are in a column called "ACCOUNT" populated with text values.
NetSuite permits adding fields (columns) to the search using a stripped down version of SQL Server. It also permits conditional highlighting of entire rows using the same code. However I don't see an obvious way to compare values between rows of data.
Although duplicates can be grouped together in a summary report and identified by a count of 2 or more, I want to show duplicate lines separately and highlight each.
The closest thing I found was a clever formula that calculates a running total here:
sum/* comment */({amount})
OVER(PARTITION BY {name}
ORDER BY {internalid}
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
I wonder if it's possible to sort results by the field being checked for duplicates and adapt this code to identify changes in the "ACCOUNT" field between a row and the previous row.
Any ideas? Thanks!
This post has been edited. I have left the progression as a learning experience about NetSuite.
Original - plain SQL way - not suitable for NetSuite
Does something like this meet your needs? The test data assumes looking for duplicates on id1 and id2. Note: This does not work in NetSuite as it supports limited SQL functions. See comments for links.
declare #table table (id1 int, id2 int, value int);
insert #table values
(1,1,11),
(1,2,12),
(1,3,13),
(2,1,21),
(2,2,22),
(2,3,23),
(1,3,1313);
--select * from #table order by id1, id2;
select t.*,
case when dups.id1 is not null then 1 else 0 end is_dup --identify dups when there is a matching dup record
from #table t
left join ( --subquery to find duplicates
select id1, id2
from #table
group by id1, id2
having count(1) > 1
) dups
on dups.id1 = t.id1
and dups.id2 = t.id2
order by t.id1, t.id2;
First Edit - NetSuite target but in SQL.
This was a SQL test based on the example available syntax provided in the question since I do not have NetSuite to test against. This will give you a value greater than 1 on each duplicate row using a similar syntax. Note: This will give the appropriate answer but not in NetSuite.
select t.*,
sum(1) over (partition by id1, id2)
from #table t
order by t.id1, t.id2;
Second Edit - Working NetSuite version
After some back and forth here is a version that works in NetSuite:
sum/* comment */(1) OVER(PARTITION BY {name})
This will also give a value greater than 1 on any row that is a duplicate.
Explanation
This works by summing the value 1 on each row included in the partition. The partition column(s) should be what you consider a duplicate. If only one column makes a duplicate (e.g. user ID) then use as above. If multiple columns make a duplicate (e.g. first name, last name, city) then use a comma-separated list in the partition. SQL will basically group the rows by the partition and add up the 1s in the sum/* comment */(1). The example provided in the question sums an actual column. By summing 1 instead we will get the value 1 when there is only 1 ID in the partition. Anything higher is a duplicate. I guess you could call this field duplicate count.

Access: Removing a duplicate if two fields are the same

I have an Access database table which sometimes contains duplicate ProfileIDs. I would like to create a query that excludes one (or more, if necessary) of the duplicate records.
The condition for a duplicate record to be excluded is: if the PriceBefore and PriceAfter fields are NOT equal, they are considered duplicate. If they are equal, the duplicate field remains.
In the example table above, the records with ID 7 and 8 have the same ProfileIDs. For ID 8, PriceBefore and PriceAfter are not equal, so this record should be excluded from the query. For ID 7, the two are equal, so it remains. Also note that PriceBefore and PriceAfter for ID 4 are the same, but as the ProfileID is not a duplicate, the record must remain.
What is the best way to do this? I am happy to use multiple queries if necessary.
Create a pointer query. Call it pQuery:
SELECT ProfileID, Sum(1) as X
FROM MyTableName
HAVING Sum(1) > 1
This will give you the ProfileID of every record that's part of a dupe.
Next, find the records where prices don't match. Call this pNoMatchQuery:
SELECT MyTableName.*
FROM MyTableName
INNER JOIN pQuery
ON pQuery.ProfileID = MyTableName.ProfileID
WHERE PriceBefore <> PriceAfter
You now have a query of every record that should be excluded from your dataset. If you want to permanently delete all of these records, run a DELETE query where you inner join your source table to pNoMatchQuery:
Delete MyTableName.*
From MyTableName
Where Exists( Select 1 From pNoMatchQuery Where pNoMatchQuery.ID = MyTableName.ID ) = True
First, make absolutely sure that pQuery and pNoMatchQuery are returning what you expect before you delete anything from your source table, because once it's gone it's gone for good (unless you make a backup first, which I would highly suggest before you run that delete the first time).

Checking existance of dynamic value before update/insert

I am trying to mass update a table column with values but I need to get the query to check whether this value already exists. If it does then to make the relevant changes before checking again and updating the table.
The database primarily holds staff information and I need to create a unique username, the script to create the username is :
select upper(LEFT(first_name,1))+LEFT(surname,3)+'1'
from staff_test
If this was used for an example user it would generate a username of ABit1 for user Andrew Bithell. What I need it to do is check to see if there already is a ABit1 username in the STAFF_TEST table and if so change Andrews username to ABit2 as the usernames have to be unique before it moves onto the next user.
I have created another table which lists all the current usernames splitting the existing usernames into 2 columns, so they display in this table as
column1 | column2
------------------
ABit |1
I have experimented with a function and I am now thinking a Merge statement might be the way to go.
Any suggestions are welcomed.
Use row_number can generate all the unique names at once:
select
upper(LEFT(first_name,1))+LEFT(surname,3)+
rtrim(row_number() over (partition by upper(LEFT(first_name,1))+LEFT(surname,3) ))
,first_name
,surname
from staff_test
Perform an up front check to see if there are any clashes:
SELECT UPPER(LEFT(first_name, 1)) + LEFT(surname, 3) + '1' AS username ,
COUNT(1) counter
FROM staff_test
GROUP BY UPPER(LEFT(first_name, 1)) + LEFT(surname, 3) + '1'
HAVING COUNT(1) > 1
ORDER BY COUNT(1) DESC
This will return each username on your staff table, grouped by the username, along with a count of how many occurrences there are of each.
You can either sanitize the data if that's what you're looking to do, otherwise I would suggest, appending an Id column value or some other unique value per record instead of 1 on the end.

MSSql Batch Update of table on Primary Keys

I have migrated an Access DB to MSSql server 2008 and found some anomalies from the old database. On both DBs IDs are auto incremental and should be in line with Date. But as shown below, some have been saved in the wrong chronological order.
**Access:**
ID FileID DateOfTransaction SectionID
64490 95900 02/12/1997 100
64491 95900 04/04/1996 80
64492 95900 25/03/1996 90
**Desired Correct Format:**
ID FileID DateOfTransaction SectionID
64492 95900 02/12/1997 100
64491 95900 04/04/1996 80
64490 95900 25/03/1996 90
The PK (ID) table is linked to several other tables with update Cascade set.
I need to group by FileID and sort by DateOfTransaction and update IDs accordingly.
I need some suggestions on how best to tackle this as data is quite sensitive. I have about 50K records to update.
Thanks for reading!
try this query
with cte as
(select * from (
select *,ROW_NUMBER() over (partition by FileID
order by DateOfTransaction) as row_num
from t_Transactions) A
join
(select ID B_ID, FileID B_FileID,ROW_NUMBER()
over (partition by FileID order by ID) as B_row_num
from t_Transactions) B
on A.row_num=B.B_row_num)
select T.ID [Old_ID], CTE.B_ID [New_ID],
T.FileID,T.DateOfTransaction,T.SectionID
--update T set T.ID=CTE.B_ID
from t_Transactions T join cte
on T.ID=CTE.ID
and CTE.B_FileID=T.FileID
Before updating , you can select and conform the result
This query updates the table as per your requirement. You have mentioned that ID column is linked to several other tables. Please be very careful about this and make sure that updating ID column doesn't break anything else
SQL Fiddle Demo
Designing a database to rely on the order of an artificially-generated key to match the date order of another column is a terrible anti-pattern, NOT best practice in the slightest.
Stop relying on it to represent insertion order. That is the answer. If you need that data, it should be another column separate from your PK. Can't you order by date, anyway? If not, create a new column.
It is always a mistake to invest internal database identifiers with meaning of any kind besides relating rows to each other.
I've seen this exact problem before at a former employer--and the database was rife with all sorts of other design problems as well. FK columns were actually named "frnkeyColumnName" to match the "keyColumnName" they pointed to. Never mind a PK that was also an FK...
Stop the madness!
I would seriously consider whether you need to do this at all. Is there any logic that depends on higher IDs having a later date? Was the data out of order in the Access database, in which case, it doesn't matter.
If you do decide to proceed, back up the data first. You're probably going to make mistakes.

Generate Unique hash for a field in SQL Server

I'm in the process of writing a Membership Provider for use with our existing membership base. I use EF4.1 for all of my database access and one of the issued that I'm running into is when the DB was originally setup the relationships were done programmatically instead of in the db. One if the relationships needs to be made on a column that isn't required for all of our users, but in order to make the relationships does need to be unique (from my understanding).
My solution that I believe will work is to do an MD5 hash on the userid field (which is unique ...which would/should guarantee a unique value in that field). The part that I'm having issues with on sql server is the query that would do this WITHOUT replacing the existing values stored in the employeeNum field (the one in question).
So in a nutshell my question is. What is the best way to get a unique value in the employeeNum field (possibly based on an md5 hash of the userid field) on all the rows in which a value isn't already present. Also, to a minor/major extent...does this sound like a good plan?
If your question is just how to generate a hash value for userid, you can do it this way using a computed column (or generate this value as part of the insert process). It isn't clear to me whether you know about the HASHBYTES function or what other criteria you're looking at when you say "best."
DECLARE #foo TABLE
(
userid INT,
hash1 AS HASHBYTES('MD5', CONVERT(VARCHAR(12), userid)),
hash2 AS HASHBYTES('SHA1', CONVERT(VARCHAR(12), userid))
);
INSERT #foo(userid) SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 500;
SELECT userid, hash1, hash2 FROM #foo;
Results:
userid hash1 hash2
------ ---------------------------------- ------------------------------------------
1 0xC4CA4238A0B923820DCC509A6F75849B 0x356A192B7913B04C54574D18C28D46E6395428AB
2 0xC81E728D9D4C2F636F067F89CC14862C 0xDA4B9237BACCCDF19C0760CAB7AEC4A8359010B0
500 0xCEE631121C2EC9232F3A2F028AD5C89B 0xF83A383C0FA81F295D057F8F5ED0BA4610947817
In SQL Server 2012, I highly recommend at least SHA2_256 instead of either of the above. (You forgot to mention what version you're using - always useful information.)
All that said, I still want to call attention to the point I made in the comments: the "best" solution here is to fix the model. If employeeNum is optional, EF shouldn't be made to think it is required or unique, and it shouldn't be used in relationships if it is not, in fact, some kind of identifier. Why would a user care about collisions between employeeNum and userid if you're using the right attribute for the relationship in the first place?
EDIT as requested by OP
So what is wrong with saying UPDATE table SET EmployeeNum = 1000000 + UserID WHERE EmployeeNum IS NULL? If EmployeeNum will stay below 1000000 then you've guaranteed no collisions and you've avoided hashing altogether.
You could generate similar padding if employeeNum might contain a string, but again is it EF that promotes these horrible column names? Why would a column with a Num suffix contain anything but a number?
You could also use a uniqueidentifier setting the default value to (newid())
Create a new column EmployeeNum as uniqueidentifer, then:
UPDATE Employees SET EmployeeNum = newid()
Then set as primary key.
UPDATE EMPLOYEE
SET EMPLOYEENUM = HASHBYTES('SHA1', CAST(USERID AS VARCHAR(20)))
WHERE EMPLOYEENUM IS NULL

Resources