Remove duplicate from a staging file - sql-server

I have a staging table which contains a who series of rows of data which where taken from a data file.
Each row details a change to a row in a remote system, the rows are effectively snapshots of the source row taken after every change. Each row contains meta data timestamps for creation and updates.
I am now trying to build an update table from these data files which contain all of the update. I require a way to remove rows with duplicate keys keeping only the row with the latest "update" timestamp.
I am aware I can use the SSIS "sort" transform to remove duplicates by sorting on the key field and telling it to remove duplicates, but how do I ensure that the row it keeps is the one with the latest time stamp?

This will remove rows with match on Col1, Col2 etc and have an UpdateDate that is NOT the most recent:
DELETE D
FROM MyTable AS D
JOIN MyTable AS T
ON T.Col1 = D.Col1
AND T.Col2 = D.Col2
...
AND T.UpdateDate > D.UpdateDate
If Col1 and Col2 need to be considered "matching" if they are both NULL then you would need to use:
ON (T.Col1 = D.Col1 OR (T.Col1 IS NULL AND D.Col1 IS NULL))
AND (T.Col2 = D.Col2 OR (T.Col2 IS NULL AND D.Col2 IS NULL))
...
Edit: If you need to make a Case Sensitive test on a Case INsensitive database then on VARCHAR and TEXT columns use:
ON (T.Col1 = D.Col1 COLLATE Latin1_General_BIN
OR (T.Col1 IS NULL AND D.Col1 IS NULL))
...

You can use the Sort Transform in SSIS to sort your data set by more than one column. Simply sort by your primary key (or ID field) followed by your timestamp column in descending order.
See the following article for more details on working with the sort Transformation?
http://msdn.microsoft.com/en-us/library/ms140182.aspx
Make sense?
Cheers, John

Does it make sense to just ignore the duplicates when moving from staging to final table?
You have to do this anyway, so why not issue one query against the staging table rather than two?
INSERT final
(key, col1, col2)
SELECT
key, col1, col2
FROM
staging s
JOIN
(SELECT key, MAX(datetimestamp) maxdt FROM staging ms ON s.key = ms.key AND s.datetimestamp = ms.maxdt

Related

Is my effort necessary and is my approach creating a suitable primary key?

I am trying to create a dimension table (NewTable) from an existing Data Warehouse table (OldTable) that doesn't have a primary key.
The OldTable holds distinct values in [IdentifierCode] and other values repeat around it. I also need to invoke 3 functions to add reporting context.
I want IdentifierCode_ID to be an INT column - as the [IdentifierCode] column is VARCHAR(6).
My question is this: is using ROW_NUMBER() (as shown below) producing a suitably unique value?
My concern is that the row order on the live table could change if other rows are inserted to remediate missed codes.
Edit: OldTable has 500k rows in all and 227k when filtered with the WHERE clause
SELECT
ROW_NUMBER() OVER (ORDER BY LoadDate, StartDate, Product, IdentifierCode) AS IdentifierCode_ID,
LoadDate,
StartDate,
EndDate,
Product,
IdentifierCode,
OtherField1, OtherField2, OtherField3, OtherField4,
Function1, Function2, Function3
INTO
NewTable
FROM
OldTable
WHERE
GETDATE() BETWEEN StartDate AND EndDate
First, unless you're either loading data once and never touching it again or are truncating NewTable before each load of a new date range, your approach will not work. ROW_NUMBER will restart at 1 and violate the primary key.
Even if you ARE truncating the table or only loading once ever, there is still a better way. Designate IdentifierCode_ID as an Identity column and SQL will take care of it for you. If the type is INT and IDENTITY is set, SQL will automatically add 1 to the last value when inserting a new row, you don't even have to assign it!
CREATE TABLE dbo.NewTable(
[IdentifierCode_ID] int IDENTITY(1,1) NOT NULL,
[IdentifierCode] VARCHAR(6) NOT NULL,
...
Also, make sure you consider what you'll do if you accidentally select an overlapping date range for subsequent loads and if values in the OldTable change - for example, add a restriction to the WHERE clause to exclude existing IdentifierCode values from the insert, and add a second query to update existing IdentifierCode values that have a different LoadDate, StartDate, etc.
...
AND NOT EXISTS (SELECT * FROM NewTable as N WHERE N.IdentifierCode = OldTable.IdentifierCode)
For updating existing rows that changed, you can do an INNER JOIN to select only existing rows and a WHERE clause for only rows that changed.
UPDATE NewTable
SET LoadDate = O.LoadDate, StartDate = O.StartDate, ... --don't forget to recalculate the functions!
FROM NewTable as N INNER JOIN OldTable as O on N.IdentifierCode = O.IdentifierCode
WHERE GETDATE() between O.StartDate and O.EndDate
AND NOT (N.StartDate = O.StartDate and N.EndDate = O.EndDate ... )

Update Query with Order by while preventing two users updating the same row

I have a SQL Server table with an expirydate column, I want to update rows on this table with the nearest expirydate, running two queries (select then update) won't work because two users may update the same row at the same time, so it has be one query.
The following query:
Update Top(5) table1
Set col1 = 1
Output deleted.* Into table2
This query runs fine but it doesn't sort by expirydate
This query:
WITH q AS
(
SELECT TOP 5 *
FROM table1
ORDER BY expirydate
)
UPDATE table1
SET col1 = 1
OUTPUT deleted.* INTO table2
WHERE table1.id IN (SELECT id FROM q)
It works but again I run the risk of two users updating the same row at the same time
What options do I have to make this work?
Thanks for the help
In these types of scenarios if you want a more optimistic concurrency approach, you need to include either an Order By AND / OR a Where clause to filter out the rows.
In application design it is common to use SELECT TOP (#count) FROM... style queries to fill the interface, however to execute DELETE or UPDATE statements you would use the primary key to specifically identify the rows to modify.
As long as you are not executing delete, then you could use a timestamp or other date based descriminator column to ensure that your updates only affect the rows that haven't been changed since the last select.
So you could query the current time as part of the select query:
SELECT TOP 5 *, SYSDATETIMEOFFSET() as [Selected]
FROM table1
ORDER BY expirydate
or query for the timestamp first, and add a created column to the table to track new records so you do not include them in deletes, either way you need to ensure that the query to select the rows will always return the same records, even if I run it tomorrow, which means you will need to ensure that no one can modify the expirydate column, if that could be modified, then you can't use it as your primary sort or filter key.
DECLARE #selectedTimestamp DateTimeOffset = (SELECT SYSDatetimeoffset())
SELECT TOP 5 *, SYSDATETIMEOFFSET() as [Selected]
FROM table1
WHERE CREATED < #selectedTimestamp
ORDER BY expirydate
Then in your update, make sure you only update the rows if they have not changed since the time that we selected them, this will either require you to have setup a standard audit trigger on the table to keep created and modified columns up to date, or for you to manage it manually in your update statement:
WITH q AS
(
SELECT TOP 5 *
FROM table1
WHERE CREATED < #selectedTimestamp
ORDER BY expirydate
)
UPDATE table1
SET col1 = 1, MODIFIED = SYSDatetimeoffset()
OUTPUT deleted.* INTO table2
WHERE table1.id IN (SELECT id FROM q)
AND MODIFIED < #selectedTimestamp
In this way we are effectively ignoring our change if another user has already updated records that were in the same or similar initial selection range.
Ultimately you could combine my initial advice to UPDATE based on the primary key AND the modified dates if you are genuinely concerned about the rows being updated twice.
If you need a more pessimistic approach, you could lock the rows with a specific user based flag so that other users cannot even select those rows, but that requires a much more detailed explanation.

How to insert multiple rows in a merge?

How to insert multiple rows in a merge in SQL?
I'm using a MERGE INSERT and I'm wondering is it possible to add two rows at the same time? Under is the query I have written, but as you can see, I want to insert both boolean for IsNew, also when it is not matched, I want to add a row for IsNew = 1 and one IsNew = 0.
How can I achieve this?
MERGE ITEMS AS TARGET
USING #table AS SOURCE
ON T.[ID]=S.ID
WHEN MATCHED THEN
UPDATE SET
T.[Content] = S.[Content],
WHEN NOT MATCHED THEN
INSERT (ID, Content, TIME, IsNew)
VALUES (ID, TEXT, GETDATE(), 1),
You can't do this directly with a merge statement, but there is a simple solution.
The merge statement <merge_not_matched> clause (which is the insert...values|default values) clause can only insert one row on the target table for each row in the source table.
This means that for you to enter two rows for each match, you simply need to change the source table - in this case, it's as simple as a cross join query.
However the <merge_matched> clause requires that only a single row from the source can match any single row from the target - or you will get the following error:
The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
To solve this problem you will have to add a condition to the when match to make sure only one row from the source table updates the target table:
MERGE Items AS T
USING (
SELECT Id, Text, GetDate() As Date, IsNew
FROM #table
-- adding one row for each row in source
CROSS JOIN (SELECT 0 As IsNew UNION SELECT 1) AS isNewMultiplier
) AS S
ON T.[ID]=S.ID
WHEN MATCHED AND S.IsNew = 1 THEN -- Note the added condition here
UPDATE SET
T.[Content] = S.[Text]
WHEN NOT MATCHED THEN
INSERT (Id, Content, Time, IsNew) VALUES
(Id, Text, Date, IsNew);
You can see a live demo on rextester.
With all that being said, I would like to refer you to another stackoverflow post that offers a better alternative then using the merge statement.
The author of the answer is a Microsoft MVP and an SQL Server expert DBA, you should at least read what he has to say.
It seems you can't achieve this using a merge statement. It may be better for you to split the two into separate queries for update and insert.
For example:
UPDATE ITEMS SET ITEMS.ID = #table.ID FROM ITEMS INNER JOIN #table ON ITEMS.ID = #table.ID
INSERT INTO ITEMS (ID, Content, TIME, IsNew) SELECT (ID, TEXT, GETDATE(), 1) FROM #table
INSERT INTO ITEMS (ID, Content, TIME, IsNew) SELECT (ID, TEXT, GETDATE(), 0) FROM #table
This will insert both rows as desired, mimicking your merge statement. However, your update statement won't do much - if you're matching based on ID, then it's impossible for you to have any IDs to update. If you wanted to update other fields, then you could change it like this:
UPDATE ITEMS SET ITEMS.Content = #table.TEXT FROM ITEMS INNER JOIN #table ON ITEMS.ID = #table.ID

How can I create duplicate rows for each row in a table that does not contain a specific value?

Keep in mind, this IS for homework. I've been stuck on this problem for at least a week now.
I need to add a row to the table Vendors for each vendor (each has a VendorID, and VendorName) that does not have a VendorState value of CA.
I'm not quite grasping how to exclude rows with a specific value, but I suspect a sub-query is involved.
Any help is much appreciated.
edit--
here is the question word for word
Write an INSERT statement that adds a row to the VendorCopy table for
each non-California vendor in the Vendors table. (This will result in
duplicate vendors in the VendorCopy table.)
you can use cursor for retrieve each row data in select query.
If I have understood correctly your question you can use this script for insert new row for specific condition:
insert into Vendors
SELECT 'new val' col1,'new val' col2, VendorState FROM Vendors
where VendorState <> 'ca'
------------Edit---------------------
if you want to create new table (copy of vendor) you can use this script:
SELECT * into Vendors_Copy FROM Vendors
WHERE VendorState <> 'ca'
You just need this query:
INSERT INTO table_vendors (VendorName, VendorState)
SELECT VendorName, VendorState FROM table_vendors WHERE VendorState != 'CA'
Keep in mind that if VendorId is a primary key or no duplicate index, you need this field to auto increment. Also, You need to specify all fields except VendorId because you can't duplicate the ID.
Something like this may be
Insert into some_Other (col1, col2 ...)
SELECT v.VendorName, o.State
FROM table_vendors v
CROSS APPLY table_state o
where v.vendor_id = o.vendorid and v.vendor_state != 'CA'

how to get multiple sets of distinct values

This is not about distinct combinations of values (Select distinct col1, col2 from table)
I have a table with a newly loaded csv file.
Some columns are linked to foreign key dimensions but the values in a given column may not exist in the reference tables.
My desire is to find all the values in each column that do not exist but in such a way as to minimize the amount of table scans in our source table.
My current approach consumes the output of a bunch of queries like the following:
SELECT DISTINCT col2 FROM table WHERE col2 NOT IN (SELECT val FROM DimCol2)
SELECT DISTINCT col3 FROM table WHERE col3 NOT IN (SELECT val FROM DimCol3)
however, for N columns, this results in N table scans.
Table is up to 10M rows and columns range in cardinality from 5 through to 5M, but almost all values are already present in the dim tables (>99%).
DimColN ranges in size from 5 values to 50M values, and is well indexed.
The csv is loaded into table via SSIS, so splitting pre-processing inside SSIS is possible, but i would have to avoid a sql query for each row.
The ssis server does not have enough spare ram to cache all the dim tables.
What about using a LEFT JOIN and checking where the results of the join are null, meaning they don't exist in DimCol2
SELECT DISTINCT Col2
FROM table a
LEFT JOIN DimCol2 on a.Col2 = b.val
WHERE b.val IS NULL

Resources