Removing Duplicates with SQL Express 2017 - sql-server

I have a table of 120 million rows. About 8 million of those rows are duplicates depending on what value/column I use to determine duplicates. For argument sake, I'm testing out the email column vs multiple columns to see what happens with my data.
The file is about 10GB, so I cannot simply add another table to the database because of the size limits of SQL Express. Instead, I thought I'd try to extract, truncate, insert using a temp table since I've been meaning to try that method out.
I know I can use CTE to remove the duplicates, but every single time I try to do that it takes forever and my system locks up. My solution is to do the following.
1.Extract all rows to tempdb
2.Sort by Min(id)
3.Truncate original table
4.Transfer new unique data from tempdb back to main table
5.Take the extra duplicates and trim to uniques using Delimit
6.Import the leftover rows back into the database.
My table looks like the following.
Name Gender Age Email ID
Jolly Female 28 jolly#jolly.com 1
Jolly Female 28 jolly#jolly.com 2
Jolly Female 28 jolly#jolly.com 3
Kate Female 36 kate#kate.com 4
Kate Female 36 kate#kate.com 5
Kate Female 36 kate#kate.com 6
Jack Male 46 jack#jack.com 7
Jack Male 46 jack#jack.com 8
Jack Male 46 jack#jack.com 9
My code
SET IDENTITY_INSERT test.dbo.contacts ON
GO
select name, gender, age, email, id into ##contacts
from test.dbo.contacts
WHERE id IN
(SELECT MIN(id) FROM test.dbo.contacts GROUP BY name)
TRUNCATE TABLE test.dbo.contacts
INSERT INTO test.dbo.contacts
SELECT name, gender, age, total_score, id
from ##students
SET IDENTITY_INSERT test.dbo.contactsOFF
GO
This code is almost working, except for the following error that I see.
"An explicit value for the identity column in table 'test.dbo.contacts' can only be specified when a column list is used and IDENTITY_INSERT is ON.
I have absolutely no idea why I keep seeing that message since I turned identity_insert on and off.
Can somebody please tell me what I'm missing in the code? And if anybody has another solution to keep unique rows I'd love to hear about it.

You said that your original problem was that " it takes forever and my system locks up".
The problem is the amount of time necessary for the operation and the lock escalation to table lock.
My suggestion is to break down the operation so that you delete less than 5000 rows at time.
I assume you have less than 5000 duplicates for each name.
You can read more about lock escalation here:
https://www.sqlpassion.at/archive/2014/02/25/lock-escalations/
About your problem (identity insert), your script contains at least two errors so I guess it's not the original one, so it hard to say why the original one fails.
use test;
if object_ID('dbo.contacts') is not null drop table dbo.contacts;
CREATE TABLE dbo.contacts
(
id int identity(1,1) primary key clustered,
name nvarchar(50),
gender varchar(15),
age tinyint,
email nvarchar(50),
TS Timestamp
)
INSERT INTO [dbo].[contacts]([name],[gender],[age],[email])
VALUES
('Jolly','Female',28,'jolly#jolly.com'),
('Jolly','Female',28,'jolly#jolly.com'),
('Jolly','Female',28,'jolly#jolly.com'),
('Kate','Female',36,'kate#kate.com'),
('Kate','Female',36,'kate#kate.com'),
('Kate','Female',36,'kate#kate.com'),
('Jack','Male',46,'jack#jack.com'),
('Jack','Male',46,'jack#jack.com'),
('Jack','Male',46,'jack#jack.com');
--for the purpose of the lock escalation, I assume you have less then 5.000 duplicates for each single name.
if object_ID('tempdb..#KillList') is not null drop table #KillList;
SELECT KL.*, C.TS
into #KillList
from
(
SELECT [name], min(ID) GoodID
from dbo.contacts
group by name
having count(*) > 1
) KL inner join
dbo.contacts C
ON KL.GoodID = C.id
--This has the purpose of testing concurrent updates on relevant rows
--UPDATE [dbo].[contacts] SET Age = 47 where ID=7;
--DELETE [dbo].[contacts] where ID=7;
while EXISTS (SELECT top 1 1 from #KillList)
BEGIN
DECLARE #id int;
DECLARE #name nvarchar(50);
DECLARE #TS binary(8);
SELECT top 1 #id=GoodID, #name=Name, #TS=TS from #KillList;
BEGIN TRAN
if exists (SELECT * from [dbo].[contacts] where id=#id and TS=#TS)
BEGIN
DELETE FROM C
from [dbo].[contacts] C
where id <> #id and Name = #name;
DELETE FROM #KillList where Name = #name;
END
ELSE
BEGIN
ROLLBACK TRAN;
RAISERROR('Concurrency error while deleting %s', 16, 1, #name);
RETURN;
END
commit TRAN;
END
SELECT * from [dbo].[contacts];

I wrote it this way, that you can see the sub results of each query.
The inner sql should not have *, instead use id.
delete from [contacts] where id in
(
select id from
(
select *, ROW_NUMBER() over (partition by name, gender, age, email order by id) as rowid from [contacts]
) rowstobedeleted where rowid>1
)
If this takes too long/makes much load, you can use SET ROWCOUNT to provide smaller chunks, but then you need to run it until nothing is delete anymore.

I think that you need something like this:
INSERT INTO test.dbo.contacts (idcol1,col2)
VALUES (value1,value2)

Related

Snowflake - how to do multiple DML operations on same primary key in a specific order?

I am trying to set up continuous data replication in Snowflake. I get the transactions happened in source system and I need to perform them in Snowflake in the same order as source system. I am trying to use MERGE for this, but when there are multiple operations on same key in source system, MERGE is not working correctly. It either misses an operation or returns duplicate row detected during DML operation error.
Please note that the transactions need to happen in exact order and it is not possible to take the latest transaction for a key and just do it (like if a record has been INSERTED and UPDATED, in Snowflake too it needs to be inserted first and then updated even though insert is only transient state) .
Here is the example:
create or replace table employee_source (
id int,
first_name varchar(255),
last_name varchar(255),
operation_name varchar(255),
binlogkey integer
)
create or replace table employee_destination ( id int, first_name varchar(255), last_name varchar(255) );
insert into employee_source values (1,'Wayne','Bells','INSERT',11);
insert into employee_source values (1,'Wayne','BellsT','UPDATE',12);
insert into employee_source values (2,'Anthony','Allen','INSERT',13);
insert into employee_source values (3,'Eric','Henderson','INSERT',14);
insert into employee_source values (4,'Jimmy','Smith','INSERT',15);
insert into employee_source values (1,'Wayne','Bellsa','UPDATE',16);
insert into employee_source values (1,'Wayner','Bellsat','UPDATE',17);
insert into employee_source values (2,'Anthony','Allen','DELETE',18);
MERGE into employee_destination as T using (select * from employee_source order by binlogkey)
AS S
ON T.id = s.id
when not matched
And S.operation_name = 'INSERT' THEN
INSERT (id,
first_name,
last_name)
VALUES (
S.id,
S.first_name,
S.last_name)
when matched AND S.operation_name = 'UPDATE'
THEN
update set T.first_name = S.first_name, T.last_name = S.last_name
When matched
And S.operation_name = 'DELETE' THEN DELETE;
I am expecting to see - Bellsat - as last name for employee id 1 in the employee_destination table after all rows get processed. Same way, I should not see emp id 2 in the employee_destination table.
Is there any other alternative to MERGE to achieve this? Basically to go over every single DML in the same order (using binlogkey column for ordering) .
thanks.
You need to manipulate your source data to ensure that you only have one record per key/operation otherwise the join will be non-deterministic and will (dpending on your settings) either error or will update using a random one of the applicable source records. This is covered in the documentation here https://docs.snowflake.com/en/sql-reference/sql/merge.html#duplicate-join-behavior.
In any case, why would you want to update a record only for it to be overwritten by another update - this would be incredibly inefficient?
Since your updates appear to include the new values for all rows, you can use a window function to get to just the latest incoming change, and then merge those results into the target table. For example, the select for that merge (with the window function to get only the latest change) would look like this:
with SOURCE_DATA as
(
select COLUMN1::int ID
,COLUMN2::string FIRST_NAME
,COLUMN3::string LAST_NAME
,COLUMN4::string OPERATION_NAME
,COLUMN5::int PROCESSING_ORDER
from values
(1,'Wayne','Bells','INSERT',11),
(1,'Wayne','BellsT','UPDATE',12),
(2,'Anthony','Allen','INSERT',13),
(3,'Eric','Henderson','INSERT',14),
(4,'Jimmy','Smith','INSERT',15),
(1,'Wayne','Bellsa','UPDATE',16),
(1,'Wayne','Bellsat','UPDATE',17),
(2,'Anthony','Allen','DELETE',18)
)
select * from SOURCE_DATA
qualify row_number() over (partition by ID order by PROCESSING_ORDER desc) = 1
That will produce a result set that has only the changes required to merge into the target table:
ID
FIRST_NAME
LAST_NAME
OPERATION_NAME
PROCESSING_ORDER
1
Wayne
Bellsat
UPDATE
17
2
Anthony
Allen
DELETE
18
3
Eric
Henderson
INSERT
14
4
Jimmy
Smith
INSERT
15
You can then change the when not matched to remove the operation_name. If it's listed as an update and it's not in the target table, it's because it was inserted in a previous operation in the new changes.
For the when matched clause, you can use the operation_name to determine if the row should be updated or deleted.

Creating a trigger that pulls the price from one table and prevents entry into another

So, I'm kinda an SQL beginner who got himself into a conundrum.
I have a database that's set up with the concept of a luxury goods store. The table Products includes the field Price which can't be less than $500 (through a trigger). The table ProductReceipt has ID, ProductID, ReceiptID, Amount, Installments
I want to create a trigger that pulls the price from Products and if the price is higher than $600, you can input True on Installments. If it's lower, then it forces a False or rollback into a False. This should also take into account the field Amount, so if something costs $500, but there's two of said item, the Installments option should be applicable.
I'm not sure how exactly to go about it? A Join, perhaps?
The Product table contains for example:
ID: 1,
Name: Rouge Coco
Description: Random Description
Price: $500
CompanyID: 1 (in the Company table, 1 is Chanel)
As you said you are designing the data, so you can handle this while designing the database only.
I have tested the following in SQL Server.
CREATE TABLE Products
(Price int
, Installments AS CASE WHEN Price > 600 THEN 'TRUE'
WHEN Price < 600 THEN 'FALSE'
ELSE NULL END
)
Try this
create table Products123
(
ID int,
Name varchar(10),
Description varchar(100),
Price money,
CompanyID int
)
create table ProductReceipt123
(
ID int,
ProductID int,
ReceiptID int,
Amount money,
Installments bit
)
create TRIGGER tInsert
ON Products123
AFTER INSERT,DELETE
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from
-- interfering with SELECT statements.
SET NOCOUNT ON;
Declare #insertedPrice money,#insertedID int
select #insertedPrice=Price,#insertedID=ID from inserted
if(#insertedPrice<500)
Begin
delete from Products123 where ID=#insertedID
select 'Price should be greater than 500'
End
else if(#insertedPrice>600)
Begin
insert into ProductReceipt123(ProductID,Amount,Installments) values(#insertedID,#insertedPrice,1)
End
else if(#insertedPrice<600)
Begin
insert into ProductReceipt123(ProductID,Amount,Installments) values(#insertedID,#insertedPrice,0)
End
-- Insert statements for trigger here
END
GO
Whilst this may not 100% answer the question you as described (does not use triggers) it will do the same job, more efficiently (although I stand to be corrected on that) and with less code
Your reason for using a trigger seems odd, like you have to use three triggers for the sake of using triggers?
Any way, here is the constraint:
ALTER TABLE ProductReceipt ADD CONSTRAINT CK_Installments CHECK ((Amount > 600 AND Installments IN (1,0)) OR (Amount <= 600 AND Installments = 0))

SQL scheduled jobs Failed Notification

I need idea or a way to check if my schedule jobs work and insert the right data. I Have a function that call a stored procedure to count inserted data from today and insert into specific field.
like this
SET #male = (select count(idUser) from (select idUser from #tmpLog2 where sex = 1 AND CAST(catchTime as date) = #DATE group by idUser)u);
SET #female = (select count(idUser) from (select idUser from #tmpLog2 where sex = 0 AND CAST(catchTime as date) = #DATE group by idUser)u);
INSERT INTO CatchLog
(
[male],
[female]
)
VALUES
(
#male,
#female
)
the stored procedure works OK, but sometime when today have a lot of data, the result inserted 0 for male / female.
It's possible to have 0 data inserted when today data really no male / female.
but sometime it inserted 0 but there are male n female data. Anyone can help me to check, how to check if the data inserted true and give some report or insert into error Log if data inserted not true?
sorry for my bad english
I dont know what column are you using into Count() Function. But if you are using Nullable column into it then it may gives 0 even if record were added as column value is null.
Consider to use Primary key column or not Nullable column in count() argument.

Referential integrity issue with Untyped XML in TSQL

I am going to start off by displaying my table structures:
Numbers Table:
Id AccountId MobileNr FirstName LastName AttributeKeyValues Labels
--- ---------- ----------- ---------- ----------- ------------------- -------
490 2000046 2XXXXXXXXXX Eon du Plessis <attrs /> <lbls>
<lbl>Meep11e</lbl>
<lbl>43210</lbl>
<lbl>1234</lbl>
<lbl>Label 5</lbl>
<lbl>Label 6 (edit)</lbl>
</lbls>
-----------------------------------------------------------------------------
Labels Table:
Id AccountId Label RGB LastAssigned LastMessage
----------- ----------- ----------------- ------ ----------------------- ------------
91 2000046 Meep11e 000000 2013-04-15 13:42:06.660 NULL
-------------------------------------------------------------------------------------
This is the issue
Every number can have multiple labels assigned to it and is stored as untyped XML. In Numbers.Labels //lbls/lbl/text() you will notice that the text there will match the text in Labels.Label
This is the stored procedure which updates the Numbers.Labels column, and is run by an external application I am busy writing. The XML structure is generated by this external application, depending on which rows are read in the Labels.Label table
CREATE PROCEDURE [dbo].[UpdateLabels]
#Id INT,
#Labels XML
AS
BEGIN
UPDATE
Numbers
SET
Labels = #Labels
WHERE
Id = #Id
UPDATE
Labels
SET
LastAssigned = GETDATE()
WHERE
label
IN
(SELECT #Labels.value('(//lbls/lbl)[1]', 'VARCHAR(100)'))
END
The issue here is if 2 people log onto the same account, both with their own session, and User 1 tries to run this update stored procedure, but just before the button is pressed to do this update, user 2 deletes 1 of the labels in the Labels.label table which was included in User 1's update session, it will cause the XML to include the "Deleted" row, and can be problematic when I try to query the numbers again (The RGB column gets queried when I display the number since the label is marked up in jQuery to have a hexidecimal colored background)
My thought approach went to checking if the rows included in the built up XML exists before committing the update. How can I achieve this in TSQL? Or can any better way be recommended?
EDIT
Our table structure is intentionally denormalized, there are no foreign key constraints.
EDIT 2
Ok, it would seem my question is a bit hard, or that I brained too hard and got the dumb :). I will try and simplify.
In the Labels column in Numbers, every <lbl> element must exist within the Labels table
When updating the Labels column in Numbers, if a Label in the XML is found which does not exist in the Labels table, an error must be raised.
The XML is pre-formed in my application, meaning, every time the update is run, the old XML in the Labels column in Numbers will be REPLACED with the new XML generated by my application
This is where I need to check whether there are label nodes in my XML which no longer exists within the Labels table
I would check to see if there are rows in your xml that are not in the real table (in the database) before trying anything. And if you find something, exit out early.
Here is a Northwind example.
Use Northwind
GO
DECLARE #data XML;
SET #data =
N'
<root>
<Order>
<OrderId>10248</OrderId>
<CustomerId>VINET</CustomerId>
</Order>
<Order>
<OrderId>-9999</OrderId>
<CustomerId>CHOPS</CustomerId>
</Order>
</root>';
/* select * from dbo.Orders */
declare #Holder table ( OrderId int, CustomerId nchar(5) )
Insert Into #Holder (OrderId , CustomerId )
SELECT
T.myAlias.value('(./OrderId)[1]', 'int') AS OrderId
, T.myAlias.value('(./CustomerId)[1]', 'nchar(5)') AS CustomerId
FROM
#data.nodes('//root/Order') AS T(myAlias);
if exists (select null from #Holder h where not exists (select null from dbo.Orders realTable where realTable.OrderID = h.OrderId ))
BEGIN
print 'you have rows in your xml that are not in the real table. raise an error here'
END
Else
BEGIN
print 'Using the data'
Update dbo.Orders Set CustomerID = h.CustomerId
From dbo.Orders o , #Holder h
Where o.OrderID = h.OrderId
END

SQL Server Stored Procedure to dump oldest X records when new records added

I have a licensing scenario where when a person activates a new system it adds the old activations to a lockout table so they can only have their latest X systems activated. I need to pass a parameter of how many recent activations to keep and all older activations should be added to the lockout table if they are not already locked out. I'm not sure how best to do this, i.e. a temp table (which I've never done) etc.
For example, an activation comes in from John Doe on System XYZ. I would then need to query the activations table for all activations by John Doe and sort it by DATE DESC. John Doe may have a license allowing two systems in this case so I need all records older than the top 2 deactivated, i.e. inserted into a lockouts table.
Thanks in advance for your assistance.
Something like this perhaps?
insert into lockouts
(<column list>)
select <column list>
from (select <column list>,
row_number() over (order by date desc) as RowNum
from activations) t
where t.RowNum > #NumLicenses
It'd probably be easiest to couple to row_number() over with a view or table-valued function:
WITH ActivationRank AS
(
SELECT SystemId,ProductId,CreatedDate,ROW_NUMBER() OVER(PARTITION BY ProductId ORDER BY CreatedDate DESC) AS RANK
FROM [Activations]
)
SELECT SystemId, ProductId, CASE WHEN RANK < #lockoutParameterOrConstant 0 ELSE 1 END AS LockedOut
FROM ActivationRank
Before you invest time to read and try my approach, I want to say that Joe Stefanelli's answer is an excellent one - short, compact, advanced and probably better than mine, espacially in terms of performance. On the other hand, performance might not be your first concern (how many activations to you expect per day? per hour? per minute?) and my example may be easier to read and understand.
As I don't know how your database schema is set up, I had do to some assumptions on it. You probably won't be able to use this code as a copy and paste template, but it should give you an idea on how to do it.
You were talking about a lockout table, so I reckon you have a reason to duplicate portions of the data into a second table. If possible, I would rather use a lockout flag in the table containing the systems data, but obviously that depends on your scenario.
Please be aware that I currently do not have access to a SQL Server, so I could not check the validity of the code. I tried my best, but there may be typos in it even though.
First assumption: A minimalistic "registered systems" table:
CREATE TABLE registered_systems
(id INT NOT NULL IDENTITY,
owner_id INT NOT NULL,
system_id VARCHAR(MAX) NOT NULL,
activation_date DATETIME NOT NULL)
Second assumption: A minimalistic "locked out systems" table:
CREATE TABLE locked_out_systems
(id INT NOT NULL,
lockout_date DATETIME NOT NULL)
Then we can define a stored procedure to activate a new system. It takes the owner_id, the number of allowed systems and of course the new system id as parameters.
CREATE PROCEDURE register_new_system
#owner_id INT,
#allowed_systems_count INT,
#new_system_id VARCHAR(MAX)
AS
BEGIN TRANSACTION
-- Variable declaration
DECLARE #sid INT -- Storage for a system id
-- Insert the new system
INSERT INTO registered_systems
(owner_id, system_id, activation_date)
VALUES
(#owner_id, #system_od, GETDATE())
-- Use a cursor to query all registered-and-not-locked-out systems for this
-- owner. Skip the first #allowed_systems_count systems, then insert the
-- remaining ones into the lockout table.
DECLARE c_systems CURSOR FAST_FORWARD FOR
SELECT system_id FROM
registered_systems r
LEFT OUTER JOIN
locked_out_systems l
ON r.system_id = l.system_id
WHERE l.system_id IS NULL
ORDER BY r.activation_date DESC
OPEN c_systems
FETCH NEXT FROM c_systems INTO #sid
WHILE ##FETCH_STATUS = 0
BEGIN
IF #allowed_systems_count > 0
-- System still allowed, just decrement the counter
SET #allowed_systems_count = #allowed_systems_count -1
ELSE
-- All allowed systems used up, insert this one into lockout table
INSERT INTO locked_out_systems
(id, lockout_date)
VALUES
(#sid, GETDATE())
FETCH NEXT FROM c_systems INTO #sid
END
CLOSE c_systems
DEALLOCATE c_systems
COMMIT

Resources