Merge statement optimization - sql-server

I have a two tables in SQL Server, in which one is the source for a MERGE operation into another.
The source table has 30Mil Records
The Target table has 180Mil Records. Both tables have 227 columns.
I do have SSIS, but I'm told in this case, a MERGE statement is the better option. Below is a shortened version of it:
;WITH MySource as (
SELECT * FROM [STAGE].[dbo].[STAGE_TABLE]
)
MERGE [EDW].[dbo].[TARGET_TABLE] AS MyTarget
USING MySource
ON MySource.[ID_FIELD] = MyTarget.[ID_FIELD]
AND MySource.[LoadDate] >= MyTarget.[LoadDate]
WHEN MATCHED THEN UPDATE SET
<<Target Column>> = MySource.<<Source Colums>> --227 columns
WHEN NOT MATCHED THEN INSERT
(
[ID_FIELD],
[LoadDate],
<<225 Other Columns>>
)
VALUES (
MySource.[ID_FIELD],
MySource.[LoadDate],
MySource.<<225 other columns>>
);
The only changes I made to the script above is truncating the list of columns to keep the code block here short.
My Problem is that I am getting hung on the execution. The profile screen shows a CXPACKET suspension with the error: cwaitpipenewrow, node=2.
How do I troubleshoot this? Thank you.

Seems like CXPACKET and suspended state means that some threads which have completed are logging that other thread's state which have not completed yet.
Please check here. The query need to update upto 1 Billion values in the table. hence it would be slow running queries.
https://dba.stackexchange.com/questions/96346/cxpacket-suspended-and-null-wait-type
https://www.sqlshack.com/troubleshooting-the-cxpacket-wait-type-in-sql-server/
Hope these articles might help you debug.

Related

Streams + tasks missing inserts?

We've setup a stream on a table that is continuously loaded via snowpipe.
We're consuming this data with a task that runs every minute where we merge into another table. There is a possibility of duplicate keys so we use a ROW_NUMBER() window function, ordered by the file created timestamp descending where row_num=1. This way we always get the latest insert
Initially we used a standard task with the merge statement but we noticed that in some instances, since snowpipe does not guarantee loading in order of when the files were staged, we were updating rows with older data. As such, on the WHEN MATCHED section we added a condition so only when the file created ts > existing, to update the row
However, since we did that, reconciliation checks show that some new inserts are missing. I don't know for sure why changing the matched clause would interfere with the not matched clause.
My theory was that the extra clause added a bit of time to the task run where some runs were skipped or the next run happened almost immediately after the last one completed. The idea being that the missing rows were caught up in the middle and the offset changed before they could be consumed
As such, we changed the task to call a stored procedure which uses an explicit transaction. We did this because the docs seem to suggest that using a transaction will lock the stream. However even with this we can see that new inserts are still missing. We're talking very small numbers e.g. 8 out of 100,000s
Any ideas what might be happening?
Example task code below (not the sp version)
WAREHOUSE = TASK_WH
SCHEDULE = '1 minute'
WHEN SYSTEM$stream_has_data('my_stream')
AS
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED AND ms.file_created >= pd.file_created THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....
;
I am not fully sure what is going wrong here, but the file created time related recommendation is given by Snowflake somewhere. It suggest that the file created timestamp is calculated in cloud service and it may be bit different than you think. There is another recommendation related to snowpipe and data ingestion. The queue service takes a min to consume the data from pipe and if you have lot of data being flown inside with in a min, you may end up this issue. Look you implementation and simulate if pushing data in 1min interval solve that issue and don't rely on file create time.
The condition "AND ms.file_created >= pd.file_created" seems to be added as a mechanism to avoid updating the same row multiple times.
Alternative approach could be using IS DISTINCT FROM to compare source against target columns(except id):
MERGE INTO processed_data pd USING (
select
ms.*,
CASE WHEN ms.status IS NULL THEN 1/mv.count ELSE NULL END as pending_count,
CASE WHEN ms.status='COMPLETE' THEN 1/mv.count ELSE NULL END as completed_count
from my_stream ms
JOIN my_view mv ON mv.id = ms.id
qualify
row_number() over (
partition by
id
order by
file_created DESC
) = 1
) ms ON ms.id = pd.id
WHEN NOT MATCHED THEN INSERT (col1, col2, col3,... )
VALUES (ms.col1, ms.col2, ms.col3,...)
WHEN MATCHED
AND (pd.col1, pd.col2,..., pd.coln) IS DISTINCT FROM (ms.col1, ms.col2,..., ms.coln)
THEN UPDATE SET pd.col1 = ms.col1, pd.col2 = ms.col2, pd.col3 = ms.col3, ....;
This approach will also prevent updating row when nothing has changed.

Set values on a new table from a historical table, before adding new table back into historical table

I would like to create a historical view of alerts in an application. To do this, I am grabbing all events and timestamping them, then uploading them into a MS SQL table. I would also like to be able to exempt certain objects from the total count by flagging either the finding (to exclude the finding across all systems) or the object in the finding (to exclude the object from all findings).
The idea is, I will have all previous alerts in the main table, then I will set an 'exemptobject' or 'exemptfinding' bit column in the row. When I re-run the script weekly, I will upload the results directly into a temporary table and then I would like to compare either the object or the finding for each object in the temporary table to the main database's 'object' or 'finding' and set the respective 'exemptobject' or 'exemptfinding' bit. Once all the temporary table's objects have any exemption bits set, insert the temporary table into the main table and drop the temporary table to keep a historical record.
This will give me duplicate findings and objects, so I am having difficulty with the merge command:
BEGIN TRANSACTION
MERGE INTO [dbo].[temp_table]
USING [dbo].[historical]
ON [dbo].[temp_table].[object] = [dbo].[historical].[object] OR
[dbo].[temp_table].[finding] = [dbo].[historical].[finding]
WHEN MATCHED THEN
UPDATE
SET [exemptfinding] = [dbo].[historical].[exemptfinding]
,[exemptobject] = [dbo].[historical].[exemptobject]
,[exemptdate] = [dbo].[historical].[exemptdate]
,[comments] = [dbo].[historical].[comments];
COMMIT
This seems to do what I want, but I see that the results are going to grow exponentially and I think it won't be sustainable for long.
BEGIN TRANSACTION
UPDATE [dbo].[temp]
SET [dbo].[temp].[exemptfinding] = [historical].[exemptfinding]
,[dbo].[temp].[exemptobject] = [historical].[exemptobject]
,[dbo].[temp].[exemptdate] = [historical].[exemptdate]
,[dbo].[temp].[comments] = [historical].[comments]
FROM [dbo].[temp] temp
INNER JOIN [dbo].[historical] historical
ON (
[temp].[finding] = [sci].[finding] OR
[temp].[object] = [sci].[object] OR
) AND
(
[historical].[exemptfinding] = 1 OR
[historical].[exemptobject] = 1
)
COMMIT
I feel like I need to normalize the database, but I can't think of a way to separate things out and be able to:
See a count of each finding based on date the script was run
Be able to drill down into each day and see all the findings, objects and recommendations for each
Control the count shown for each finding by removing 'exempted' findings OR objects.
I feel like there's something obvious I'm missing or I'm thinking about this incorrectly. Any help would be greatly appreciated!
EDIT - The following seems to do what I want, but as soon as I add an additional WHERE condition to the final result, the query time goes from 7 seconds to 90 seconds, so I fear it will not scale.
BEGIN TRANSACTION
UPDATE [dbo].[temp]
SET [dbo].[temp].[exemptrecommendation] = [historical].[exemptrecommendation]
,[dbo].[temp].[exemptfinding] = [historical].[exemptfinding]
,[dbo].[temp].[exemptobject] = [historical].[exemptobject]
,[dbo].[temp].[exemptdate] = [historical].[exemptdate]
,[dbo].[temp].[comments] = [historical].[comments]
FROM (
SELECT *
FROM historical h
WHERE EXISTS (
SELECT id
,recommendation
FROM temp t
WHERE (
t.id = s.id OR
t.recommendation = s.recommendation
)
)
) historical
WHERE [dbo].[temp].[recommendation] = [historical].[recommendation] OR
[dbo].[temp].[id] = [historical].[id]
COMMIT

Missing Insert Record on the table with Trigger (Oracle DB)

The scenario is the following -
OrderTable with Columns "OrderId" and "OrderType"
OrderRelationTable with Columns "OrderId" and "CustId"
OrderProcessTable with Columns "OrderId", "OrderType", "CustId", and "ProcessFlag"
The flow goes like this-
Application1 creates the record in OrderTable -> Then pass the record to Application2 by using MQ protocol, Application 2 in this case insert/create the record passed in the OrderRelationTable -> Then a trigger is called in Oracle DB to create the record in OrderProcessTable
Problem
Sometimes the record is not inserted into table 3 OrderProcessTable. Not sure if it is cause by timing or there is something that is not correct with the trigger?
Application1 Code
boolean updated = false;
/** JDBC prepare statement execution insert into OrderTable in Java**/
int rowCount = ps.executeUpdate();
if(rowCount>0){
updated=true;
}
log.log("updated flag:"+updated);
/** I am able to see the log shows the flag is true, and recored inserted into OrderTable **/
Application2 Code
This doesn't really matter much assuming that it is some Java JDBC code that does the insert into OrderRelationTable and it is successful.
The Trigger
Assuming the syntax is correct.
CREATE OR REPLACE TRIGGER INSERTINTOOrderProcessTable
AFTER INSERT ON OrderRelationTable
FOR EACH ROW DECLEAR
v_order_type := null;
BEGIN
SELECT OrderType INTO v_order_type FROM OrderTable
WHERE OrderId = :new.OrderId
AND OrderType IS NOT NULL
AND rownum=1;
IF v_order_type IS NOT NULL THEN
INSERT INTO OrderProcessTable VALUES (:new.OrderId, v_order_type, :new.CustId, 'N');
END IF;
END;
Questions -
After the Application 1 Code is executed is guaranteed DB will have the OrderTable record avaliable for SELECT statement? (Assume that updated flag is true)
Is there a timing issue with the app code and trigger? for example when trigger calls the SELECT statement from OrderTable? (of course the order id is matching in the OrderRelationTable and OrderTable)
Basically right now my problem is that sometimes (rarely) the record is not inserted into OrderProcessTable via the trigger even though it should (Order Type is not null)? Any idea?
There's no timing issue, as far as I can tell.
As of trigger code: what is the purpose of and rownum = 1 condition? I'm not saying that it is wrong, I'm just asking. Do you expect several rows to be returned by that query? If so, is that a legal situation? Wouldn't you rather handle it with the WHEN TOO_MANY_ROWS exception handler (i.e. instead of using the ROWNUM condition)?
What happens if SELECT returns nothing? It raises then NO_DATA_FOUND exception and trigger fails and certainly doesn't insert anything. Is it propagated so that someone (human being) or something (error logging procedure) sees / catches it so that you'd know that something went wrong?
And, of course, the fact that V_ORDER_TYPE remains NULL which causes INSERT to fail (as P. Salmon already suggested).

checking if the same row exists in the table or not [duplicate]

I've got a table with data named energydata
it has just three columns
(webmeterID, DateTime, kWh)
I have a new set of updated data in a table temp_energydata.
The DateTime and the webmeterID stay the same. But the kWh values need updating from temp_energydata table.
How do I write the T-SQL for this the correct way?
Assuming you want an actual SQL Server MERGE statement:
MERGE INTO dbo.energydata WITH (HOLDLOCK) AS target
USING dbo.temp_energydata AS source
ON target.webmeterID = source.webmeterID
AND target.DateTime = source.DateTime
WHEN MATCHED THEN
UPDATE SET target.kWh = source.kWh
WHEN NOT MATCHED BY TARGET THEN
INSERT (webmeterID, DateTime, kWh)
VALUES (source.webmeterID, source.DateTime, source.kWh);
If you also want to delete records in the target that aren't in the source:
MERGE INTO dbo.energydata WITH (HOLDLOCK) AS target
USING dbo.temp_energydata AS source
ON target.webmeterID = source.webmeterID
AND target.DateTime = source.DateTime
WHEN MATCHED THEN
UPDATE SET target.kWh = source.kWh
WHEN NOT MATCHED BY TARGET THEN
INSERT (webmeterID, DateTime, kWh)
VALUES (source.webmeterID, source.DateTime, source.kWh)
WHEN NOT MATCHED BY SOURCE THEN
DELETE;
Because this has become a bit more popular, I feel like I should expand this answer a bit with some caveats to be aware of.
First, there are several blogs which report concurrency issues with the MERGE statement in older versions of SQL Server. I do not know if this issue has ever been addressed in later editions. Either way, this can largely be worked around by specifying the HOLDLOCK or SERIALIZABLE lock hint:
MERGE INTO dbo.energydata WITH (HOLDLOCK) AS target
[...]
You can also accomplish the same thing with more restrictive transaction isolation levels.
There are several other known issues with MERGE. (Note that since Microsoft nuked Connect and didn't link issues in the old system to issues in the new system, these older issues are hard to track down. Thanks, Microsoft!) From what I can tell, most of them are not common problems or can be worked around with the same locking hints as above, but I haven't tested them.
As it is, even though I've never had any problems with the MERGE statement myself, I always use the WITH (HOLDLOCK) hint now, and I prefer to use the statement only in the most straightforward of cases.
I often used Bacon Bits great answer as I just can not memorize the syntax.
But I usually add a CTE as an addition to make the DELETE part more useful because very often you will want to apply the merge only to a part of the target table.
WITH target as (
SELECT * FROM dbo.energydate WHERE DateTime > GETDATE()
)
MERGE INTO target WITH (HOLDLOCK)
USING dbo.temp_energydata AS source
ON target.webmeterID = source.webmeterID
AND target.DateTime = source.DateTime
WHEN MATCHED THEN
UPDATE SET target.kWh = source.kWh
WHEN NOT MATCHED BY TARGET THEN
INSERT (webmeterID, DateTime, kWh)
VALUES (source.webmeterID, source.DateTime, source.kWh)
WHEN NOT MATCHED BY SOURCE THEN
DELETE
If you need just update your records in energydata based on data in temp_energydata, assuming that temp_enerydata doesn't contain any new records, then try this:
UPDATE e SET e.kWh = t.kWh
FROM energydata e INNER JOIN
temp_energydata t ON e.webmeterID = t.webmeterID AND
e.DateTime = t.DateTime
Here is working sqlfiddle
But if temp_energydata contains new records and you need to insert it to energydata preferably with one statement then you should definitely go with the answer that Bacon Bits gave.
UPDATE ed
SET ed.kWh = ted.kWh
FROM energydata ed
INNER JOIN temp_energydata ted ON ted.webmeterID = ed.webmeterID
Update energydata set energydata.kWh = temp.kWh
where energydata.webmeterID = (select webmeterID from temp_energydata as temp)
THE CORRECT WAY IS :
UPDATE test1
INNER JOIN test2 ON (test1.id = test2.id)
SET test1.data = test2.data

Why does an UPDATE take much longer than a SELECT?

I have the following select statement that finishes almost instantly.
declare #weekending varchar(6)
set #weekending = 100103
select InvoicesCharges.orderaccnumber, Accountnumbersorders.accountnumber
from Accountnumbersorders, storeinformation, routeselecttable,InvoicesCharges, invoice
where InvoicesCharges.pubid = Accountnumbersorders.publication
and Accountnumbersorders.actype = 0
and Accountnumbersorders.valuezone = 'none'
and storeinformation.storeroutename = routeselecttable.istoreroutenumber
and storeinformation.storenumber = invoice.store_number
and InvoicesCharges.invoice_number = invoice.invoice_number
and convert(varchar(6),Invoice.bill_to,12) = #weekending
However, the equivalent update statement takes 1m40s
declare #weekending varchar(6)
set #weekending = 100103
update InvoicesCharges
set InvoicesCharges.orderaccnumber = Accountnumbersorders.accountnumber
from Accountnumbersorders, storeinformation, routeselecttable,InvoicesCharges, invoice
where InvoicesCharges.pubid = Accountnumbersorders.publication
and Accountnumbersorders.actype = 0
and dbo.Accountnumbersorders.valuezone = 'none'
and storeinformation.storeroutename = routeselecttable.istoreroutenumber
and storeinformation.storenumber = invoice.store_number
and InvoicesCharges.invoice_number = invoice.invoice_number
and convert(varchar(6),Invoice.bill_to,12) = #weekending
Even if I add:
and InvoicesCharges.orderaccnumber <> Accountnumbersorders.accountnumber
at the end of the update statement reducing the number of writes to zero, it takes the same amount of time.
Am I doing something wrong here? Why is there such a huge difference?
transaction log file writes
index updates
foreign key lookups
foreign key cascades
indexed views
computed columns
check constraints
locks
latches
lock escalation
snapshot isolation
DB mirroring
file growth
other processes reading/writing
page splits / unsuitable clustered index
forward pointer/row overflow events
poor indexes
statistics out of date
poor disk layout (eg one big RAID for everything)
Check constraints with UDFs that have table access
...
Although, the usual suspect is a trigger...
Also, your condition extra has no meaning: How does SQL Server know to ignore it? An update is still generated with most of the baggage... even the trigger will still fire. Locks must be held while rows are searched for the other conditions for example
Edited Sep 2011 and Feb 2012 with more options
The update has to lock and modify the data in the table, and also log the changes to the transaction log. The select does not have to do any of those things.
Because reading does not affect indices, triggers, and what have you?
In Slow servers or large database i usually use UPDATE DELAYED, that waits for a "break" to update the database itself.

Resources