Delete duplicates from large dataset (>100Mio rows)

Delete duplicates from large dataset (>100Mio rows) - sql-server

I know that this topic came up many times before here but none of the suggested solutions worked for my dataset because my laptop stopped calculating due to memory issues or full storage.
My table looks like the following and has 108 Mio rows:
Col1 |Col2 | Col3 |Col4 |SICComb | NameComb
Case New |3523 | Alexander |6799 |67993523| AlexanderCase New
Case New |3523 | Undisclosed |6799 |67993523| Case NewUndisclosed
Undisclosed|6799 | Case New |3523 |67993523| Case NewUndisclosed
Case New |3523 | Undisclosed |6799 |67993523| Case NewUndisclosed
SmartCard |3674 | NEC |7373 |73733674| NECSmartCard
SmartCard |3674 | Virtual NetComm|7373 |73733674| SmartCardVirtual NetComm
SmartCard |3674 | NEC |7373 |73733674| NECSmartCard
The unique columns are SICComb and NameComb. I tried to add a primary key with:
ALTER TABLE dbo.test ADD ID INT IDENTITY(1,1)
but the integers are filling up more than 30 GB of my storage just in a new minutes.
Which would be the fastest and most efficient method to delete the duplicates from the table?

If you're using SQL Server, you can use delete from common table expression:
with cte as (
select row_number() over(partition by SICComb, NameComb order by Col1) as row_num
from Table1
)
delete
from cte
where row_num > 1
Here all rows will be numbered, you get own sequence for each unique combination of SICComb + NameComb. You can choose which rows you want to delete by choosing order by inside the over clause.

In general, the fastest way to delete duplicates from a table is to insert the records -- without duplicates -- into a temporary table, truncate the original table and insert them back in.
Here is the idea, using SQL Server syntax:
select distinct t.*
into #temptable
from t;
truncate table t;
insert into t
select tt.*
from #temptable;
Of course, this depends to a large extent on how fast the first step is. And, you need to have the space to store two copies of the same table.
Note that the syntax for creating the temporary table differs among databases. Some use the syntax of create table as rather than select into.
EDIT:
Your identity insert error is troublesome. I think you need to remove the identity from the list of columns for the distinct. Or do:
select min(<identity col>), <all other columns>
from t
group by <all other columns>
If you have an identity column, then there are no duplicates (by definition).
In the end, you will need to decide which id you want for the rows. If you can generate a new id for the rows, then just leave the identity column out of the column list for the insert:
insert into t(<all other columns>)
select <all other columns>;
If you need the old identity value (and the minimum will do), turn off identity insert and do:
insert into t(<all columns including identity>)
select <all columns including identity>;

Related

How to efficiently replace long strings by their index for SQL Server inserts?

I have a very large DataTable-Object which I need to import from a client into an MS SQL-Server database via ODBC.
The original Data-Table has two columns:
* First column is the Office Location (quite a long string)
* Second column is a booking value (integer)
Now I am looking for the most efficient way to insert these data into an external SQL-Server. My goal is to replace each office location automatically by an index instead using the full string because each location occurs VERY often in the initial table.
Is this possible via a trigger or via a view on the SQL-server?
At the end I want to insert the data without touching them in my script because this is very slow for these large amount of data and let the optimization done by the SQL Server.
I expect that if I do INSERT the data including the Office location, that SQL Server looks up an index for an already imported location and then use just this index. And if the location did not already exist in the index table / view then it should create a new entry here and then use the new index.
Here a sample of the data I need to import via ODBC into the SQL-Server:
OfficeLocation | BookingValue
EU-Germany-Hamburg-Ostend1 | 12
EU-Germany-Hamburg-Ostend1 | 23
EU-Germany-Hamburg-Ostend1 | 34
EU-France-Paris-Eifeltower | 42
EU-France-Paris-Eifeltower | 53
EU-France-Paris-Eifeltower | 12
What I do need on the SQL-Server is something like these 2 tables as a result:
OId|BookingValue OfficeLocation |Oid
1|12 EU-Germany-Hamburg-Ostend1 | 1
1|23 EU-France-Paris-Eifeltower | 2
1|43
2|42
2|53
2|12
My initial idea was, to write the data into a temp-table and have something like an intelligent TRIGGER (or a VIEW?) to react on any INSERT into this table to create the 2 desired (optimized) tables.
Any hint are more than welcome!

Yes, you can create a view with an INSERT trigger to handle this. Something like:
CREATE TABLE dbo.Locations (
OId int IDENTITY(1,1) not null PRIMARY KEY,
OfficeLocation varchar(500) not null UNIQUE
)
GO
CREATE TABLE dbo.Bookings (
OId int not null,
BookingValue int not null
)
GO
CREATE VIEW dbo.CombinedBookings
WITH SCHEMABINDING
AS
SELECT
OfficeLocation,
BookingValue
FROM
dbo.Bookings b
INNER JOIN
dbo.Locations l
ON
b.OId = l.OId
GO
CREATE TRIGGER CombinedBookings_Insert
ON dbo.CombinedBookings
INSTEAD OF INSERT
AS
INSERT INTO Locations (OfficeLocation)
SELECT OfficeLocation
FROM inserted where OfficeLocation not in (select OfficeLocation from Locations)
INSERT INTO Bookings (OId,BookingValue)
SELECT OId, BookingValue
FROM
inserted i
INNER JOIN
Locations l
ON
i.OfficeLocation = l.OfficeLocation
As you can see, we first add to the locations table any missing locations and then populate the bookings table.
A similar trigger can cope with Updates. I'd generally let the Locations table just grow and not attempt to clean it up (for no longer referenced locations) with triggers. If growth is a concern, a periodic job will usually be good enough.
Be aware that some tools (such as bulk inserts) may not invoke triggers, so those will not be usable with the above view.

SQL unique PK for grouped data in SP

I am trying to build a temp table with grouped data from multiple tables (in an SP), I am successful in building the data set however I have a requirement that each grouped row have a unique id. I know there are ways to generate unique ids for each row, However the problem I have is that I need the id for a given row to be the same on each run regardless of the number of rows returned.
Example:
1st run:
ID Column A Column B
1 apple 15
2 orange 10
3 grape 11
2nd run:
ID Column A Column B
3 grape 11
The reason I want this is because i am sending this data up to SOLR and when I do a delta I need to have the ID back for the same row as its trying to re-index
Any way I can do this?

Not sure if this will help, not entirely confident of your wider picture, but ...
As your new data is assembled, log each [column a] value in a table of your own.
Give that table an IDENTITY column to do the numbering for you.
Now you can join any new data sets to your lookup table and you'll have a persistent number for each column A.
You just need to ensure that each time you query new data, you add new values to the lookup table.
create table dbo.myRef(
idx int identity(1,1)
,[A] nvarchar(100)
)
General draft as below ...
--- just simulating some input data here
with cte as (
select 'apple' as [A], 15 as [B]
UNION
select 'orange' as [A], 10 as [B]
UNION
select 'banana' as [A], 4 as [B]
)
select * into #temp from cte;
-- Put any new values into the lookup table
-- and they will be assigned a new index number by the identity column
insert into dbo.myRef([A])
select distinct [A]
from #temp where [A] not in (select [A] from dbo.myRef)
-- now pull your original data for output, joining to the lookup table to get a ref number.
select T.*,R.idx
from #temp T
inner join
oer.myRef R
on T.[A] = R.[A]

Sorry for the late reply, i was stuck with something else, however i solved my own issue.
I built 2 temp tables one with all the data from the various tables (#master) and another temp table (#final) to house all the grouped data with an empty column for ID
Next i did a concat(column1, '-',column2,'-', column3) on 3 columns from the #master and updated the #final table based on the type
this helped me to get the same concat ids on each run

How to shift entire row from last to 3rd position without changing values in SQL Server

This is my table:
DocumentTypeId DocumentType UserId CreatedDtm
--------------------------------------------------------------------------
2d47e2f8-4 PDF 443f-4baa 2015-12-03 17:56:59.4170000
b4b-4803-a Images a99f-1fd 1997-02-11 22:16:51.7000000
600-0e32 XL e60e07a6b 2015-08-19 15:26:11.4730000
40f8ff9f Word 79b399715 1994-04-23 10:33:44.2300000
8230a07c email 750e-4c3d 2015-01-10 09:56:08.1700000
How can I shift the last entire row (DocumentType=email) on 3rd position,(before DocumentType=XL) without changing table values?

Without wishing to deny the truth of what others have said here, SQL Server does have CLUSTERED indices. For full details on these and the difference between a clustered table and a non-clustered one, please see here. In effect, a clustered table does have data written to disk in index order. However, due to subsequent insertions and deletions, you should never rely on any given record being in a fixed ordinal position.
To get your data showing email third and XL fourth, you simply need to order by CreatedDtm. Thus:
declare #test table
(
DocumentTypeID varchar(20),
DocumentType varchar(10),
UserID varchar(20),
CreatedDtm datetime
)
INSERT INTO #test VALUES
('2d47e2f8-4','PDF','443f-4baa','2015-12-03 17:56:59'),
('b4b-4803-a','Images','a99f-1fd','1997-02-11 22:16:51'),
('600-0e32','XL','e60e07a6b','2015-08-19 15:26:11'),
('40f8ff9f','Word','79b399715','1994-04-23 10:33:44'),
('8230a07c','email','750e-4c3d','2015-01-10 09:56:08')
SELECT * FROM #test order by CreatedDtm
This gives a result set of:
40f8ff9f Word 79b399715 1994-04-23 10:33:44.000
b4b-4803-a Images a99f-1fd 1997-02-11 22:16:51.000
8230a07c email 750e-4c3d 2015-01-10 09:56:08.000
600-0e32 XL e60e07a6b 2015-08-19 15:26:11.000
2d47e2f8-4 PDF 443f-4baa 2015-12-03 17:56:59.000
This maybe what you are looking for, but I cannot stress enough, that it only gives email 3rd and XL 4th in this particular case. If the dates were different, it would not be so. But perhaps, this was all that you needed?

I assumed that you need to sort by DocumentTypecolumn.
Joining with a temp table, which may contain virtually DocumenTypes with desired SortOrder, you can achieve the result you want.
declare #tbl table(
DocumentTypeID varchar(50),
DocumentType varchar(50)
)
insert into #tbl(DocumentTypeID, DocumentType)
values
('2d47e2f8-4','PDF'),
('b4b-4803-a','Images'),
('600-0e32','XL'),
('40f8ff9f','Word'),
('8230a07c','email')
;
--this will give you original output
select * from #tbl;
--this will output rows with new sort order
select t.* from #tbl t
inner join
(
select *
from
(values
('PDF',1, 1),
('Images',2, 2),
('XL',3, 4),
('Word',4, 5),
('email',5, 3) --here I put new sort order '3'
) as dt(TypeName, SortOrder, NewSortOrder)
) dt
on dt.TypeName = t.DocumentType
order by dt.NewSortOrder

The row positions don't really matter in SQL tables, since it's all unordered sets of data, but if you really want to switch the rows I'd suggest you send all your data to temp table e.g,
SELECT * FROM [tablename] INTO #temptable
then delete/truncate the data from that table (if it won't mess the other tables it's connected to) and use the temp table you made to insert into it as you like, since it'll have all the same fields with the same data from the original.

Computed column expression

I have a specific need for a computed column called ProductCode
ProductId | SellerId | ProductCode
1 1 000001
2 1 000002
3 2 000001
4 1 000003
ProductId is identity, increments by 1.
SellerId is a foreign key.
So my computed column ProductCode must look how many products does Seller have and be in format 000000. The problem here is how to know which Sellers products to look for?
I've written have a TSQL which doesn't look how many products does a seller have
ALTER TABLE dbo.Product
ADD ProductCode AS RIGHT('000000' + CAST(ProductId AS VARCHAR(6)) , 6) PERSISTED

You cannot have a computed column based on data outside of the current row that is being updated. The best you can do to make this automatic is to create an after-trigger that queries the entire table to find the next value for the product code. But in order to make this work you'd have to use an exclusive table lock, which will utterly destroy concurrency, so it's not a good idea.
I also don't recommend using a view because it would have to calculate the ProductCode every time you read the table. This would be a huge performance-killer as well. By not saving the value in the database never to be touched again, your product codes would be subject to spurious changes (as in the case of perhaps deleting an erroneously-entered and never-used product).
Here's what I recommend instead. Create a new table:
dbo.SellerProductCode
SellerID LastProductCode
-------- ---------------
1 3
2 1
This table reliably records the last-used product code for each seller. On INSERT to your Product table, a trigger will update the LastProductCode in this table appropriately for all affected SellerIDs, and then update all the newly-inserted rows in the Product table with appropriate values. It might look something like the below.
See this trigger working in a Sql Fiddle
CREATE TRIGGER TR_Product_I ON dbo.Product FOR INSERT
AS
SET NOCOUNT ON;
SET XACT_ABORT ON;
DECLARE #LastProductCode TABLE (
SellerID int NOT NULL PRIMARY KEY CLUSTERED,
LastProductCode int NOT NULL
);
WITH ItemCounts AS (
SELECT
I.SellerID,
ItemCount = Count(*)
FROM
Inserted I
GROUP BY
I.SellerID
)
MERGE dbo.SellerProductCode C
USING ItemCounts I
ON C.SellerID = I.SellerID
WHEN NOT MATCHED BY TARGET THEN
INSERT (SellerID, LastProductCode)
VALUES (I.SellerID, I.ItemCount)
WHEN MATCHED THEN
UPDATE SET C.LastProductCode = C.LastProductCode + I.ItemCount
OUTPUT
Inserted.SellerID,
Inserted.LastProductCode
INTO #LastProductCode;
WITH P AS (
SELECT
NewProductCode =
L.LastProductCode + 1
- Row_Number() OVER (PARTITION BY I.SellerID ORDER BY P.ProductID DESC),
P.*
FROM
Inserted I
INNER JOIN dbo.Product P
ON I.ProductID = P.ProductID
INNER JOIN #LastProductCode L
ON P.SellerID = L.SellerID
)
UPDATE P
SET P.ProductCode = Right('00000' + Convert(varchar(6), P.NewProductCode), 6);
Note that this trigger works even if multiple rows are inserted. There is no need to preload the SellerProductCode table, either--new sellers will automatically be added. This will handle concurrency with few problems. If concurrency problems are encountered, proper locking hints can be added without deleterious effect as the table will remain very small and ROWLOCK can be used (except for the INSERT which will require a range lock).
Please do see the Sql Fiddle for working, tested code demonstrating the technique. Now you have real product codes that have no reason to ever change and will be reliable.

I would normally recommend using a view to do this type of calculation. The view could even be indexed if select performance is the most important factor (I see you're using persisted).
You cannot have a subquery in a computed column, which essentially means that you can only access the data in the current row. The only ways to get this count would be to use a user-defined function in your computed column, or triggers to update a non-computed column.
A view might look like the following:
create view ProductCodes as
select p.ProductId, p.SellerId,
(
select right('000000' + cast(count(*) as varchar(6)), 6)
from Product
where SellerID = p.SellerID
and ProductID <= p.ProductID
) as ProductCode
from Product p
One big caveat to your product numbering scheme, and a downfall for both the view and UDF options, is that we're relying upon a count of rows with a lower ProductId. This means that if a Product is inserted in the middle of the sequence, it would actually change the ProductCodes of existing Products with a higher ProductId. At that point, you must either:
Guarantee the sequencing of ProductId (identity alone does not do this)
Rely upon a different column that has a guaranteed sequence (still dubious, but maybe CreateDate?)
Use a trigger to get a count at insert which is then never changed.

Preserving ORDER BY in SELECT INTO

I have a T-SQL query that takes data from one table and copies it into a new table but only rows meeting a certain condition:
SELECT VibeFGEvents.*
INTO VibeFGEventsAfterStudyStart
FROM VibeFGEvents
LEFT OUTER JOIN VibeFGEventsStudyStart
ON
CHARINDEX(REPLACE(REPLACE(REPLACE(logName, 'MyVibe ', ''), ' new laptop', ''), ' old laptop', ''), excelFilename) > 0
AND VibeFGEventsStudyStart.MIN_TitleInstID <= VibeFGEvents.TitleInstID
AND VibeFGEventsStudyStart.MIN_WinInstId <= VibeFGEvents.WndInstID
WHERE VibeFGEventsStudyStart.excelFilename IS NOT NULL
ORDER BY VibeFGEvents.id
The code using the table relies on its order, and the copy above does not preserve the order I expected. I.e. the rows in the new table VibeFGEventsAfterStudyStart are not monotonically increasing in the VibeFGEventsAfterStudyStart.id column copied from VibeFGEvents.id.
In T-SQL how might I preserve the ordering of the rows from VibeFGEvents in VibeFGEventsStudyStart?

I know this is a bit old, but I needed to do something similar. I wanted to insert the contents of one table into another, but in a random order. I found that I could do this by using select top n and order by newid(). Without the 'top n', order was not preserved and the second table had rows in the same order as the first. However, with 'top n', the order (random in my case) was preserved. I used a value of 'n' that was greater than the number of rows. So my query was along the lines of:
insert Table2 (T2Col1, T2Col2)
select top 10000 T1Col1, T1Col2
from Table1
order by newid()

What for?
Point is – data in a table is not ordered. In SQL Server the intrinsic storage order of a table is that of the (if defined) clustered index.
The order in which data is inserted is basically "irrelevant". It is forgotten the moment the data is written into the table.
As such, nothing is gained, even if you get this stuff. If you need an order when dealing with data, you HAVE To put an order by clause on the select that gets it. Anything else is random - i.e. the order you et data is not determined and may change.
So it makes no sense to have a specific order on the insert as you try to achieve.
SQL 101: sets have no order.

Just add top to your sql with a number that is greater than the actual number of rows:
SELECT top 25000 *
into spx_copy
from SPX
order by date

I've found a specific scenario where we want the new table to be created with a specific order in the columns' content:
Amount of rows is very big (from 200 to 2000 millions of rows), so we are using SELECT INTO instead of CREATE TABLE + INSERT because needs to be loaded as fast as possible (minimal logging). We have tested using the trace flag 610 for loading an already created empty table with a clustered index but still takes longer than the following approach.
We need the data to be ordered by specific columns for query performances, so we are creating a CLUSTERED INDEX just after the table is loaded. We discarded creating a non-clustered index because it would need another read for the data that's not included in the ordered columns from the index, and we discarded creating a full-covering non-clustered index because it would practically double the amount of space needed to hold the table.
It happens that if you manage to somehow create the table with columns already "ordered", creating the clustered index (with the same order) takes a lot less time than when the data isn't ordered. And sometimes (you will have to test your case), ordering the rows in the SELECT INTO is faster than loading without order and creating the clustered index later.
The problem is that SQL Server 2012+ will ignore the ORDER BY column list when doing INSERT INTO or when doing SELECT INTO. It will consider the ORDER BY columns if you specify an IDENTITY column on the SELECT INTO or if the inserted table has an IDENTITY column, but just to determine the identity values and not the actual storage order in the underlying table. In this case, it's likely that the sort will happen but not guaranteed as it's highly dependent on the execution plan.
A trick we have found is that doing a SELECT INTO with the result of a UNION ALL makes the engine perform a SORT (not always an explicit SORT operator, sometimes a MERGE JOIN CONCATENATION, etc.) if you have an ORDER BY list. This way the select into already creates the new table in the order we are going to create the clustered index later and thus the index takes less time to create.
So you can rewrite this query:
SELECT
FirstColumn = T.FirstColumn,
SecondColumn = T.SecondColumn
INTO
#NewTable
FROM
VeryBigTable AS T
ORDER BY -- ORDER BY is ignored!
FirstColumn,
SecondColumn
to
SELECT
FirstColumn = T.FirstColumn,
SecondColumn = T.SecondColumn
INTO
#NewTable
FROM
VeryBigTable AS T
UNION ALL
-- A "fake" row to be deleted
SELECT
FirstColumn = 0,
SecondColumn = 0
ORDER BY
FirstColumn,
SecondColumn
We have used this trick a few times, but I can't guarantee it will always sort. I'm just posting this as a possible workaround in case someone has a similar scenario.

You cannot do this with ORDER BY but if you create a Clustered Index on VibeFGEvents.id after your SELECT INTO the table will be sorted on disk by VibeFGEvents.id.

I'v made a test on MS SQL 2012, and it clearly shows me, that insert into ... select ... order by makes sense. Here is what I did:
create table tmp1 (id int not null identity, name sysname);
create table tmp2 (id int not null identity, name sysname);
insert into tmp1 (name) values ('Apple');
insert into tmp1 (name) values ('Carrot');
insert into tmp1 (name) values ('Pineapple');
insert into tmp1 (name) values ('Orange');
insert into tmp1 (name) values ('Kiwi');
insert into tmp1 (name) values ('Ananas');
insert into tmp1 (name) values ('Banana');
insert into tmp1 (name) values ('Blackberry');
select * from tmp1 order by id;
And I got this list:
1 Apple
2 Carrot
3 Pineapple
4 Orange
5 Kiwi
6 Ananas
7 Banana
8 Blackberry
No surprises here. Then I made a copy from tmp1 to tmp2 this way:
insert into tmp2 (name)
select name
from tmp1
order by id;
select * from tmp2 order by id;
I got the exact response like before. Apple to Blackberry.
Now reverse the order to test it:
delete from tmp2;
insert into tmp2 (name)
select name
from tmp1
order by id desc;
select * from tmp2 order by id;
9 Blackberry
10 Banana
11 Ananas
12 Kiwi
13 Orange
14 Pineapple
15 Carrot
16 Apple
So the order in tmp2 is reversed too, so order by made sense when there is a identity column in the target table!

The reason why one would desire this (a specific order) is because you cannot define the order in a subquery, so, the idea is that, if you create a table variable, THEN make a query from that table variable, you would think you would retain the order(say, to concatenate rows that must be in order- say for XML or json), but you can't.
So, what do you do?
The answer is to force SQL to order it by using TOP in your select (just pick a number high enough to cover all your rows).

I have run into the same issue and one reason I have needed to preserve the order is when I try to use ROLLUP to get a weighted average based on the raw data and not an average of what is in that column. For instance, say I want to see the average of profit based on number of units sold by four store locations? I can do this very easily by creating the equation Profit / #Units = Avg. Now I include a ROLLUP in my GROUP BY so that I can also see the average across all locations. Now I think to myself, "This is good info but I want to see it in order of Best Average to Worse and keep the Overall at the bottom (or top) of the list)." The ROLLUP will fail you in this so you take a different approach.
Why not create row numbers based on the sequence (order) you need to preserve?
SELECT OrderBy = ROW_NUMBER() OVER(PARTITION BY 'field you want to count' ORDER BY 'field(s) you want to use ORDER BY')
, VibeFGEvents.*
FROM VibeFGEvents
LEFT OUTER JOIN VibeFGEventsStudyStart
ON
CHARINDEX(REPLACE(REPLACE(REPLACE(logName, 'MyVibe ', ''), ' new laptop', ''), ' old laptop', ''), excelFilename) > 0
AND VibeFGEventsStudyStart.MIN_TitleInstID <= VibeFGEvents.TitleInstID
AND VibeFGEventsStudyStart.MIN_WinInstId <= VibeFGEvents.WndInstID
WHERE VibeFGEventsStudyStart.excelFilename IS NOT NULL
Now you can use the OrderBy field from your table to set the order of values. I removed the ORDER BY statement from the query above since it does not affect how the data is loaded to the table.

I found this approach helpful to solve this problem:
WITH ordered as
(
SELECT TOP 1000
[Month]
FROM SourceTable
GROUP BY [Month]
ORDER BY [Month]
)
INSERT INTO DestinationTable (MonthStart)
(
SELECT * from ordered
)

Try using INSERT INTO instead of SELECT INTO
INSERT INTO VibeFGEventsAfterStudyStart
SELECT VibeFGEvents.*
FROM VibeFGEvents
LEFT OUTER JOIN VibeFGEventsStudyStart
ON
CHARINDEX(REPLACE(REPLACE(REPLACE(logName, 'MyVibe ', ''), ' new laptop', ''), ' old laptop', ''), excelFilename) > 0
AND VibeFGEventsStudyStart.MIN_TitleInstID <= VibeFGEvents.TitleInstID
AND VibeFGEventsStudyStart.MIN_WinInstId <= VibeFGEvents.WndInstID
WHERE VibeFGEventsStudyStart.excelFilename IS NOT NULL
ORDER BY VibeFGEvents.id`

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Delete duplicates from large dataset (>100Mio rows) - sql-server

Related

How to efficiently replace long strings by their index for SQL Server inserts?

SQL unique PK for grouped data in SP

How to shift entire row from last to 3rd position without changing values in SQL Server

Computed column expression

Preserving ORDER BY in SELECT INTO

Categories

Resources