I am downloading data that will have duplicates with previously downloaded data.
I am successfully using a MERGE statement to throw away the duplicates based on transaction number. Supposedly that is sufficient, but I would like to monitor if the detail ever changes on a particular transaction.
To that end I added a when matched clause on the merge with an additional insert that identifies the record as a duplicate.
This logic should not trigger very often so I am not too concerned that this method (if it worked) would report the same duplicate multiple times.
When I prepare this code I get this error message:
An action of type 'INSERT' is not allowed in the 'WHEN MATCHED' clause of a MERGE statement.
Is there a way to get the duplicate record to insert into this table or another table using the MERGE statement?
I am open to other solutions, but I would really like to find a way to do this with the MERGE statement because it would impact my code the least.
MERGE INTO dbo.TransactionDetail as t
USING (SELECT #TransNr --bigint
,#Detail -- [VARCHAR](50) NOT NULL
) as s
([TranNr]
,[Detail]
)
on t.TranNr = s.TranNr and t.CHANGED_RECORD = 0
when not matched then
INSERT (CHANGED_RECORD
,[TranNr]
,[Detail]
)
VALUES(0, s.TranNr, s.Detail)
/* Adding this does not allow statement to be prepared....
when matched and s.Detail <> t.Detail
then
INSERT (CHANGED_RECORD
,[TranNr]
,[Detail]
)
VALUES(1, s.TranNr, s.Detail)
*/
;
You can use an INSERT statement, like this:
INSERT INTO dbo.TransactionDetail (CHANGED_RECORD,TranNr,Detail)
SELECT CASE WHEN EXISTS (
SELECT * FROM dbo.TransactionDetail
WHERE TranNr=#TransNr AND CHANGED_RECORD=0 AND Detail<>#Detail
) THEN 1 ELSE 0 END AS CHANGED_RECORD,
#TransNr AS TranNr, #Detail AS Detail
WHERE NOT EXISTS (
SELECT * FROM dbo.TransactionDetail
WHERE TranNr=#TransNr AND CHANGED_RECORD=0 AND Detail=#Detail
)
This will skip inserting if a row which has CHANGED_RECORD=0 has the same detail. However, if the same detail it's found in another row which has CHANGED_RECORD=1, a new duplicate would be inserted. To avoid that, remove the AND CHANGED_RECORD=0 condition from the WHERE NOT EXISTS subquery.
You may also want to create a unique filtered index, to ensure unicity for the rows which have CHANGED_RECORD=0:
CREATE UNIQUE INDEX IX_TransactionDetail_Filtered
ON TransactionDetail (TranNr) /*INCLUDE (Detail)*/ WHERE CHANGED_RECORD=0
The INCLUDE (Detail) clause could also marginally improve performance of queries that are looking for the Detail of the rows which have CHANGED_RECORD=0 (at the expense of some additional disk space and a small performance penalty when updating the Detail column of existing rows).
Related
I have a process which generates a number of potentially asymmetric outcomes (record A links to record B, but record B may not link to record A). Each of these outcomes are stored in a table and I want to insert into that same table all the missing links (i.e. generate a row for every case where record B links to record A) - without generating duplicates.
From looking around it seems that NOT EXISTS is the preferred method for this. But as this is an INSERT into the same table, I wanted to see if anyone had ideas for a more efficient approach (table size will vary from ~50,000 to ~20,000,000).
INSERT INTO [table1]
([record_id], [linked_record_id], [flag_value])
SELECT
,[linked_record_id] AS [record_id]
,[record_id] AS [linked_record_id]
,[flag_value]
FROM [table1] AS A
WHERE [flag_value] = 1
AND NOT EXISTS (
SELECT
[record_id]
,[linked_record_id]
,[flag_value]
FROM [table1] AS B
WHERE [flag_value] = 1
AND A.[linked_record_id] = B.[record_id]
)
Update table1 set Name='Deepak' where id=1 and Name !='Deepak'
Does adding a condition on name column improves the performance considering that id has clustered index and Name has non clustered and there is around 60% probability of getting '0 rows updated' message after running the above query.
Reasons to Only Update If Different
Reduced locks
Prevents unnecessary activity if you have triggers or certain configs of SQL Server Replication
Preserve audit trail columns like LastModifiedDateTime
Only Update If Different Using EXCEPT
Most people's main complaint would probably be extra query complexity, but I find using EXCEPT makes this process super easy. EXCEPT is ideal because it handles any data type and NULL values without issue
UPDATE Table1
SET Col1 = #NewVal1
,Col2 = #NewVal2
....
,LastModifiedBy = #UserID
,LastModifiedDateTime = GETDATE()
WHERE EXISTS (
SELECT Col1,Col2
EXCEPT
SELECT #NewVal1,#NewVal2
)
I have a merge statement that starts like this:
MERGE INTO TEMSPASA
USING (SELECT *
FROM OPENQUERY(orad, 'SELECT * FROM CDAS.TDWHCORG')) AS TDWHPASA ON TEMSPASA.pasa_cd = LTRIM(RTRIM(TDWHPASA.corg_id)) AND
TEMSPASA.pasa_active_ind = TDWHPASA.corg_active_ind
WHEN MATCHED THEN
UPDATE
SET
TEMSPASA.pasa_desc = LTRIM(RTRIM(TDWHPASA.corg_nm)),
TEMSPASA.pasa_active_ind = TDWHPASA.corg_active_ind
WHEN NOT MATCHED THEN
INSERT (pasa_cd, pasa_desc, pasa_active_ind)
VALUES (LTRIM(RTRIM(TDWHPASA.corg_id)), TDWHPASA.corg_nm, TDWHPASA.corg_active_ind);
There are pasa_cd's like ('H04', 'H04*') where that * is NOT a wildcard. But I think the on statement is treating it like it is a wildcard because when I try to run the merge statement, I get the following error:
The MERGE statement attempted to UPDATE or DELETE the same row more than once. This happens when a target row matches more than one source row. A MERGE statement cannot UPDATE/DELETE the same row of the target table multiple times. Refine the ON clause to ensure a target row matches at most one source row, or use the GROUP BY clause to group the source rows.
I have verified that there are no duplicates in my table. The only thing I can think of is what I mentioned above, that the ON part of the merge statement is seeing that * as a wildcard.
I have tried searching, saw something about an escape character, but that was in the where clause. Any ideas how to deal with this?
This means you have more than 1 row that is matching between the source and target tables. You need to find out what row(s) are the issue here. It could be from either table. Something like this should help you identify where the problem is coming from.
SELECT LTRIM(RTRIM(TDWHPASA.corg_id))
, TDWHPASA.corg_active_ind
FROM CDAS.TDWHCORG as TDWHPASA
group by LTRIM(RTRIM(TDWHPASA.corg_id))
, TDWHPASA.corg_active_ind
having count(*) > 1
select t.pasa_cd
, t.pasa_active_ind
from TEMSPASA t
group by t.pasa_cd
, t.pasa_active_ind
having count(*) > 1
So my theory in my initial post was incorrect. There were duplicates in my source table, but it was based on the case sensitivity. There were values H04F and H04f. Both different rows, but because of the case insensitivity in my sql, it was seeing them as duplicates. To resolve the issue I added COLLATE Latin1_General_CS_AS to the end of the ON clause and it did the trick
I am loading data to table in the following manner:
DECLARE #srcRc INT;
DECLARE #dstRc INT;
SET #srcRc = ( SELECT COUNT(*) FROM A )
INSERT INTO t
(Col1
,Col2
,Col3
)
SELECT A.Col1
,A.Col2
,B.Col3
FROM A
JOIN B
ON A.Id = B.Id;
SET #dstRc = ##ROWCOUNT
Now I am comparing the variables #srcRc and #dstRc. The ROWCOUNT must be the same. If it is not, the inserted rows need to be deleted.
Q1: What would be the best strategy to rollback the inserted rows?
I have couple of ideas:
1) Run the load in transaction and rollback if the rowcount does not match.
2) Add flag column (bit) to the destination table called toBeDeleted, run the load and if the rowcount does not match, update the toBeDeleted column with 1 value to flag it as candidate for deletion. Then delete in batch mode (while-loop). Or do not delete them, but always exclude deletion candidates from query when working with t table.
3) Before inserting the rows, compare the the rowcount first. If it does not match, don't start the load.
DECLARE #srcRc INT;
DECLARE #dstRc INT;
SET #srcRc = ( SELECT COUNT(1) FROM A );
SET #dstRc = ( SELECT COUNT(1) FROM A JOIN B ON A.Id = B.Id );
Q2: What would be better solution for higher amount of rows, let's say 10-100 mil.?
Q3: Or is there any better strategy for similar case?
OK, Assuming :
You need the roll back to work at some later date when the content of tables A and B may have changed
There may also be other rows in T which you don't want to delete as part of the rollback.
Then you MUST keep a list of the rows you inserted, as you are unable to reliably regenerate that list from A and B and you cant just delete everything from T
You could do this in two ways
Change your import, so that it first inserts the rows to an import table, keep the import table hanging around until you are sure you don't need it anymore.
Add an extra column to T [importId] into which you put a uniquely identifying value
Obviously the first strategy uses a lot more disc space. So the longer your keep the data and the more data there is, the better the extra column looks.
Another option, would be to generate the list of imported data separately and have your transaction sql be a bulk insert with all the data hard coded into the sql.
This works well for small lists, initial setup data and the like.
Edit:
from your comments it sounds like you don't want a roll back per-se. But the best way to apply business logic around the import process.
In this case your 3rd answer is the best. Don't do the import when you know the source data is incorrect.
I have an update statement in SQL server where there are four possible values that can be assigned based on the join. It appears that SQL has an algorithm for choosing one value over another, and I'm not sure how that algorithm works.
As an example, say there is a table called Source with two columns (Match and Data) structured as below:
(The match column contains only 1's, the Data column increments by 1 for every row)
Match Data
`--------------------------
1 1
1 2
1 3
1 4
That table will update another table called Destination with the same two columns structured as below:
Match Data
`--------------------------
1 NULL
If you want to update the ID field in Destination in the following way:
UPDATE
Destination
SET
Data = Source.Data
FROM
Destination
INNER JOIN
Source
ON
Destination.Match = Source.Match
there will be four possible options that Destination.ID will be set to after this query is run. I've found that messing with the indexes of Source will have an impact on what Destination is set to, and it appears that SQL Server just updates the Destination table with the first value it finds that matches.
Is that accurate? Is it possible that SQL Server is updating the Destination with every possible value sequentially and I end up with the same kind of result as if it were updating with the first value it finds? It seems to be possibly problematic that it will seemingly randomly choose one row to update, as opposed to throwing an error when presented with this situation.
Thank you.
P.S. I apologize for the poor formatting. Hopefully, the intent is clear.
It sets all of the results to the Data. Which one you end up with after the query depends on the order of the results returned (which one it sets last).
Since there's no ORDER BY clause, you're left with whatever order Sql Server comes up with. That will normally follow the physical order of the records on disk, and that in turn typically follows the clustered index for a table. But this order isn't set in stone, particularly when joins are involved. If a join matches on a column with an index other than the clustered index, it may well order the results based on that index instead. In the end, unless you give it an ORDER BY clause, Sql Server will return the results in whatever order it thinks it can do fastest.
You can play with this by turning your upate query into a select query, so you can see the results. Notice which record comes first and which record comes last in the source table for each record of the destination table. Compare that with the results of your update query. Then play with your indexes again and check the results once more to see what you get.
Of course, it can be tricky here because UPDATE statements are not allowed to use an ORDER BY clause, so regardless of what you find, you should really write the join so it matches the destination table 1:1. You may find the APPLY operator useful in achieving this goal, and you can use it to effectively JOIN to another table and guarantee the join only matches one record.
The choice is not deterministic and it can be any of the source rows.
You can try
DECLARE #Source TABLE(Match INT, Data INT);
INSERT INTO #Source
VALUES
(1, 1),
(1, 2),
(1, 3),
(1, 4);
DECLARE #Destination TABLE(Match INT, Data INT);
INSERT INTO #Destination
VALUES
(1, NULL);
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN #Source Source
ON Destination.Match = Source.Match;
SELECT *
FROM #Destination;
And look at the actual execution plan. I see the following.
The output columns from #Destination are Bmk1000, Match. Bmk1000 is an internal row identifier (used here due to lack of clustered index in this example) and would be different for each row emitted from #Destination (if there was more than one).
The single row is then joined onto the four matching rows in #Source and the resultant four rows are passed into a stream aggregate.
The stream aggregate groups by Bmk1000 and collapses the multiple matching rows down to one. The operation performed by this aggregate is ANY(#Source.[Data]).
The ANY aggregate is an internal aggregate function not available in TSQL itself. No guarantees are made about which of the four source rows will be chosen.
Finally the single row per group feeds into the UPDATE operator to update the row with whatever value the ANY aggregate returned.
If you want deterministic results then you can use an aggregate function yourself...
WITH GroupedSource AS
(
SELECT Match,
MAX(Data) AS Data
FROM #Source
GROUP BY Match
)
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN GroupedSource Source
ON Destination.Match = Source.Match;
Or use ROW_NUMBER...
WITH RankedSource AS
(
SELECT Match,
Data,
ROW_NUMBER() OVER (PARTITION BY Match ORDER BY Data DESC) AS RN
FROM #Source
)
UPDATE Destination
SET Data = Source.Data
FROM #Destination Destination
INNER JOIN RankedSource Source
ON Destination.Match = Source.Match
WHERE RN = 1;
The latter form is generally more useful as in the event you need to set multiple columns this will ensure that all values used are from the same source row. In order to be deterministic the combination of partition by and order by columns should be unique.