Parallel execution of MERGE commands on JDBC driver is not working - snowflake-cloud-data-platform

I am running ~22 MERGE commands in parallel on the same JDBC connection. I am committing the transaction after all the parallel executions complete. I have turned on explicit transactions. Below is the MERGE commands I am using:
MERGE INTO _1 AS A USING ( select $1:Id::VARCHAR as Id, $1:modifiedUtc::NUMBER as MODIFIEDUTC, $1:VersionId::NUMBER as VersionId, $1 as DATA FROM '#EXTERNAL_AWS_STAGE/group1/' (FILE_FORMAT => JSON_FORMAT) ) AS B ON A.Id = B.Id WHEN MATCHED AND A.VersionId < B.VersionId THEN UPDATE SET A.VersionId = B.VersionId, A.MODIFIEDUTC = B.MODIFIEDUTC, A.DATA = B.DATA WHEN NOT MATCHED THEN INSERT (Id, MODIFIEDUTC, VersionId, DATA) VALUES (B.Id, B.MODIFIEDUTC, B.VersionId, B.DATA);
MERGE INTO _2 AS A USING ( select $1:Id::VARCHAR as Id, $1:modifiedUtc::NUMBER as MODIFIEDUTC, $1:VersionId::NUMBER as VersionId, $1 as DATA FROM '#EXTERNAL_AWS_STAGE/group2/' (FILE_FORMAT => JSON_FORMAT) ) AS B ON A.Id = B.Id WHEN MATCHED AND A.VersionId < B.VersionId THEN UPDATE SET A.VersionId = B.VersionId, A.MODIFIEDUTC = B.MODIFIEDUTC, A.DATA = B.DATA WHEN NOT MATCHED THEN INSERT (Id, MODIFIEDUTC, VersionId, DATA) VALUES (B.Id, B.MODIFIEDUTC, B.VersionId, B.DATA);
With three identical tables, The problem is that some tables are not getting updated while the others are. It's not that a particular table is not getting updated. Sometimes, say table _1 is getting updated and sometimes not. I verified that executeUpdate(query) method returns 1 for all the MERGE commands, indicating that one row is either updated or inserted for ALL the tables, but the select * from _1 returns 0 rows for some of them. The codebase either commits or rolls back the transaction and in the end closes the connection.
When I try to run the same MERGE commands via Snowflake Worksheets, I could see that data is updated in all tables.
Any pointers would be greatly helpful.
Some important observations:
Since we must have updates processed within a transaction boundary, we are calling Connection.setAutoCommit(false). If this is not enabled, above issues are not observed.
We are sending queries in batches of 20 to Snowflake for execution. All the 20 queries are fired concurrently using the same database connection. Every batch of 20 MERGE commands is followed by a pause of 2 seconds.
If we try to execute MERGE commands sequentially, the issues are not observed. Even after a couple of active locks on tables, the commits are successful and data is updated in the tables.
For the data tables which are empty, even Snowflake History says that the MERGE commands completed successfully and the row count for every MERGE command was 1.
SHOW TRANSACTIONS command shows an active transaction only till the time the transaction is active. After successful commit, no active transaction is shown by this command.
SHOW LOCKS command shows variable number of locks, ranging from 1 to more than the number of tables involved.
If we check the history of the query id returned in the result-set of SHOW LOCKS command, we find that the query has completed successfully.
Please provide any pointers which might help in further identifying the issue.

Are you using the latest Snowflake JDBC connector and explicitly setting the multi-statement options as per the following?
https://docs.snowflake.net/manuals/user-guide/jdbc-using.html#multi-statement-jdbc

Related

SSIS package with CHANGE TRACKING keeps missing records

I have an SSIS package using CHANGE TRACKING that runs every 5 minutes to perform one way synchronization on a table.
These are the DB's involved:
DestDB
SourceDB
DestDB contains a table called TableSyncVersions that is used to keep track of the most recent Sync version used to extract information from the table in SourceDB. This Sync Version is used for the next execution of the package to get the next batch of data.
SourceDB has Snapshot Isolation enabled and the CT Query is being executed by an "OLE DB Source" in SSIS. The Query is as follows:
SET TRANSACTION ISOLATION LEVEL SNAPSHOT;
BEGIN TRAN;
--Using OLE DB parameters to capture the current version within the transaction
SELECT ? = CAST(CHANGE_TRACKING_CURRENT_VERSION() AS NVARCHAR)
SELECT ct.KeyColumn1
, ct.KeyColumn2
, ct.KeyColumn3
, st.Column1
, st.Column2
, st.Column3
, st.Column4
, ct.SYS_CHANGE_OPERATION
FROM TABLE1 AS st
--Using OLE DB Parameters to reference the version # saved in TableSyncVersions
RIGHT OUTER JOIN CHANGETABLE(CHANGES TABLE1, ?) AS ct
ON avq.KeyColumn1 = ct.KeyColumn1
AND avq.KeyColumn2 = ct.KeyColumn2
AND avq.KeyColumn3 = ct.KeyColumn3
COMMIT TRAN;
Here is a screen shot of the Control Flow for this package:
At least once a day, the package misses 5-20 records even though it ran without error, the records are missed at different times everyday. Has anyone experienced anything like this with Change Tracking before?
Any help is greatly appreciated.
Thank you,
Tory Hill

Insert from select or update from select with commit every 1M records

I've already seen a dozen such questions but most of them get answers that doesn't apply to my case.
First off - the database is am trying to get the data from has a very slow network and is connected to using VPN.
I am accessing it through a database link.
I have full write/read access on my schema tables but I don't have DBA rights so I can't create dumps and I don't have grants for creation new tables etc.
I've been trying to get the database locally and all is well except for one table.
It has 6.5 million records and 16 columns.
There was no problem getting 14 of them but the remaining two are Clobs with huge XML in them.
The data transfer is so slow it is painful.
I tried
insert based on select
insert all 14 then update the other 2
create table as
insert based on select conditional so I get only so many records and manually commit
The issue is mainly that the connection is lost before the transaction finishes (or power loss or VPN drops or random error etc) and all the GBs that have been downloaded are discarded.
As I said I tried putting conditionals so I get a few records but even this is a bit random and requires focus from me.
Something like :
Insert into TableA
Select * from TableA#DB_RemoteDB1
WHERE CREATION_DATE BETWEEN to_date('01-Jan-2016') AND to_date('31-DEC-2016')
Sometimes it works sometimes it doesn't. Just after a few GBs Toad is stuck running but when I look at its throughput it is 0KB/s or a few Bytes/s.
What I am looking for is a loop or a cursor that can be used to get maybe 100000 or a 1000000 at a time - commit it then go for the rest until it is done.
This is a one time operation that I am doing as we need the data locally for testing - so I don't care if it is inefficient as long as the data is brought in in chunks and a commit saves me from retrieving it again.
I can count already about 15GBs of failed downloads I've done over the last 3 days and my local table still has 0 records as all my attempts have failed.
Server: Oracle 11g
Local: Oracle 11g
Attempted Clients: Toad/Sql Dev/dbForge Studio
Thanks.
You could do something like:
begin
loop
insert into tablea
select * from tablea#DB_RemoteDB1 a_remote
where not exists (select null from tablea where id = a_remote.id)
and rownum <= 100000; -- or whatever number makes sense for you
exit when sql%rowcount = 0;
commit;
end loop;
end;
/
This assumes that there is a primary/unique key you can use to check if a row int he remote table already exists in the local one - in this example I've used a vague ID column, but replace that with your actual key column(s).
For each iteration of the loop it will identify rows in the remote table which do not exist in the local table - which may be slow, but you've said performance isn't a priority here - and then, via rownum, limit the number of rows being inserted to a manageable subset.
The loop then terminates when no rows are inserted, which means there are no rows left in the remote table that don't exist locally.
This should be restartable, due to the commit and where not exists check. This isn't usually a good approach - as it kind of breaks normal transaction handling - but as a one off and with your network issues/constraints it may be necessary.
Toad is right, using bulk collect would be (probably significantly) faster in general as the query isn't repeated each time around the loop:
declare
cursor l_cur is
select * from tablea#dblink3 a_remote
where not exists (select null from tablea where id = a_remote.id);
type t_tab is table of l_cur%rowtype;
l_tab t_tab;
begin
open l_cur;
loop
fetch l_cur bulk collect into l_tab limit 100000;
forall i in 1..l_tab.count
insert into tablea values l_tab(i);
commit;
exit when l_cur%notfound;
end loop;
close l_cur;
end;
/
This time you would change the limit 100000 to whatever number you think sensible. There is a trade-off here though, as the PL/SQL table will consume memory, so you may need to experiment a bit to pick that value - you could get errors or affect other users if it's too high. Lower is less of a problem here, except the bulk inserts become slightly less efficient.
But because you have a CLOB column (holding your XML) this won't work for you, as #BobC pointed out; the insert ... select is supported over a DB link, but the collection version will get an error from the fetch:
ORA-22992: cannot use LOB locators selected from remote tables
ORA-06512: at line 10
22992. 00000 - "cannot use LOB locators selected from remote tables"
*Cause: A remote LOB column cannot be referenced.
*Action: Remove references to LOBs in remote tables.

Linked Server Query Runs But Doesn't Finish?

June 29, 2010 - I had an un-committed action from a previous delete statement. I committed the action and I got another error about conflicting primary id's. I can fix that. So morale of the story, commit your actions.
Original Question -
I'm trying to run this query:
with spd_data as (
select *
from openquery(IRPROD,'select * from budget_user.spd_data where fiscal_year = 2010')
)
insert into [IRPROD]..[BUDGET_USER].[SPD_DATA_BUD]
(REC_ID, FISCAL_YEAR, ENTITY_CODE, DIVISION_CODE, DEPTID, POSITION_NBR, EMPLID,
spd_data.NAME, JOB_CODE, PAY_GROUP_CODE, FUND_CODE, FUND_SOURCE, CLASS_CODE,
PROGRAM_CODE, FUNCTION_CODE, PROJECT_ID, ACCOUNT_CODE, SPD_ENC_AMT, SPD_EXP_AMT,
SPD_FB_ENC_AMT, SPD_FB_EXP_AMT, SPD_TUIT_ENC_AMT, SPD_TUIT_EXP_AMT,
spd_data.RUNDATE, HOME_DEPTID, BUD_ORIG_AMT, BUD_APPR_AMT)
SELECT REC_ID, FISCAL_YEAR, ENTITY_CODE, DIVISION_CODE, DEPTID, POSITION_NBR, EMPLID,
spd_data.NAME, JOB_CODE, PAY_GROUP_CODE, FUND_CODE, FUND_SOURCE, CLASS_CODE,
PROGRAM_CODE, FUNCTION_CODE, PROJECT_ID, ACCOUNT_CODE, SPD_ENC_AMT, SPD_EXP_AMT,
SPD_FB_ENC_AMT, SPD_FB_EXP_AMT, SPD_TUIT_ENC_AMT, SPD_TUIT_EXP_AMT,
spd_data.RUNDATE, HOME_DEPTID, lngOrig_amt, lngAppr_amt
from spd_data
left join Budgets.dbo.tblAllPosDep on project_id = projid
and job_code = jcc and position_nbr = psno
and emplid = empid
where OrgProjTest = 'EQUAL';
Basically I'm selecting a table from IRPROD (an oracle db), joining it with a local table, and inserting the results back on IRPROD.
The problem I'm having is that while the query runs, it never stops. I've let it run for an hour and it keeps going until I cancel it. I can see on a bandwidth monitor on the SQL Server data going in and out. Also, if I just run the select part of the query it returns the results in 4 seconds.
Any ideas why it's not finishing? I've got other queryies setup in a similar manner and do not have any problems (granted those insert from local tables and not a remote table).
You didn't included any volume metrics. But I would recommend to use a temporary table to gather the results.
Then you should try to insert the first couple of rows. If this succeeds you'll have a strong indicator that everything is fine.
Try to break down each insert task by project_id or emplid to avoid large transactions logs.
You should also think about crafting a bulk batch process.
If you run just the select without the insert, how many records are returned? Does the data look right or are there multiple records due to the join?
Are there triggers on the table you are inserting into? If you are returning many records and triggers are on the table that are designed to run row-byrow this could be slowing things down. You are also sending to another server, so the network pipeline could be what is slowing you down. Maybe it would be better to send the budget data to the Oracle server and do the insert from there rather than from the SQL Server.

Modify SQL result set before returning from stored procedure

I have a simple table in my SQL Server 2008 DB:
Tasks_Table
-id
-task_complete
-task_active
-column_1
-..
-column_N
The table stores instructions for uncompleted tasks that have to be executed by a service.
I want to be able to scale my system in future. Until now only 1 service on 1 computer read from the table. I have a stored procedure, that selects all uncompleted and inactive tasks. As the service begins to process tasks it updates the task_active flag in all the returned rows.
To enable scaleing of the system I want to enable deployment of the service on more machines. Because I want to prevent a task being returned to more than 1 service I have to update the stored procedure that returns uncompleted and inactive tasks.
I figured that i have to lock the table (only 1 reader at a time - I know I have to use an apropriate ISOLATION LEVEL), and updates the task_active flag in each row of the result set before returning the result set.
So my question is how to modify the SELECT result set iin the stored procedure before returning it?
This is the typical dequeue pattern, is implemented using the OUTPUT clause and and is described in the MSDN, see the Queues paragraph in OUTPUT Clause (Transact-SQL):
UPDATE TOP(1) Tasks_Table WITH (ROWLOCK, READPAST)
SET task_active = 1
OUTPUT INSERTED.id,INSERTED.column_1, ...,INSERTED.column_N
WHERE task_active = 0;
The ROWLOCK,READPAST hint allows for high throughput and high concurency: multiple threads/processed can enqueue new tasks while mutliple threads/process dequeue tasks. There is no order guarantee.
Updated
If you want to order the result you can use a CTE:
WITH cte AS (
SELECT TOP(1) id, task_active, column_1, ..., column_N
FROM Task_Table WITH (ROWLOCK, READPAST)
WHERE task_active = 0
ORDER BY <order by criteria>)
UPDATE cte
SET task_active = 1
OUTPUT INSERTED.id, INSERTED.column_1, ..., INSERTED.column_N;
I discussed this and other enqueue/dequeue techniques on the article Using Tables as Queues.

Explain locking behavior in SQL Server

Why is that, with default settings on Sql Server (so transaction isolation level = read committed), that this test:
CREATE TABLE test2 (
ID bigint,
name varchar(20)
)
then run this in one SSMS tab:
begin transaction SH
insert into test2(ID,name) values(1,'11')
waitfor delay '00:00:30'
commit transaction SH
and this one simultaneously in another tab:
select * from test2
requires the 2nd select to wait for the first to complete before returning??
We also tried these for the 2nd query:
select * from test2 NOLOCK WHERE ID = 1
and tried inserting one ID in the first query and selecting a different ID in the second.
Is this the result of page locking? When running the 2 queries, i've also ran this:
select object_name(P.object_id) as TableName, resource_type, resource_description
from
sys.dm_tran_locks L join sys.partitions P on L.resource_associated_entity_id = p.hobt_id
and gotten this result set:
test2 RID 1:12186:5
test2 RID 1:12186:5
test2 PAGE 1:12186
test2 PAGE 1:12186
requires the 2nd select to wait for
the first to complete before
returning??
read commited prevents dirty reads and by blocking you will get a consistent result, snapshot isolation gets around this but you will get slightly worse performance because now sql server will hold the old values for the duration of the transaction (better have your tempdb on a good drive)
BTW, try changing the query from
select * from test2
to
select * from test2 where id <> 1
assuming you have more than 1 row in the table and it will be over a page, insert a couple of thousand rows
List traversal with node locking is done by 'crabbing':
you have a lock current node
you grab a lock next node
you make the next node current
you release the lock on the previous node (former current)
This techniques is common in all list traversal algorithms and is meant to keep stability while traversing: you are never making a 'leap' w/o having yourself anchored in a lock. It is often compared to the techniques used by rock climbers.
A statement like SELECT ... FROM table; is a scan over the entire table. As such, it can be compared with a list traversal, and the thread doing the table scan traversal will 'crabb' ver the rows just like one doing a list traversal will crabb over the nodes. Such list traversal is guaranteed that it will attempt to lock, eventually, every single node in the list, and a table scan will similarly attempt to lock, at one time or another, every single row in the table. So any conflicting lock held by another transaction on row will block the scan, 100% guaranteed. Everything else you observe (page locks, intent locks etc) is implementation details, irrelevant to the fundamental issue.
The proper solution to this problem is to optimize the queries that they don't scan tables end-to-end. Only after that is achieved you can turn your focus to eliminate whatever contention is left: deploy snapshot isolation based row-level versionning. In other words, enable read-committed snapshot on the database.

Resources