AWS DMS MSSQL Ongoing Replication Duplicate Records

AWS DMS MSSQL Ongoing Replication Duplicate Records - sql-server

I have a task to migrate a 2-3TB on-prem MSSQL database to an AWS RDS for MSSQL and would appreciate some guidance on avoiding duplicate data in the RDS instance.
DMS tasks are split into copy-only and ongoing replication
ongoing replication task is started after the copy-only task is complete
initial data copy takes at least 10 hours to complete
source database uses only CDC to capture data changes - no SQL transactional replication
source database has many tables with no primary key or unique indexes
same tables have non-unique indexes
CDC capture starts before the copy-only task starts
Given table A with no primary key or unique indexes. CDC starts capturing table A before starting the copy-only DMS task. The copy-only task begins copying table A 4hrs later. The ongoing replication task starts 10hrs later when the copy-only task completes. It seems that inserts captured by CDC before the table was copied and replayed on the target would generate duplicate records in table A.
Is it possible to mitigate duplicate records without adding a primary key or uniqueness indexes?
Is it safe to assume that CDC captures predicates for updates?
Thanks.

Related

SQL Server: capture table updates only

Question
Is there any generic (reusable between multiple tables) and easy to implement/understand way to capture only updates at SQL Server table?
Objective
I have SQL Server 2012 SP4 transactional replica. I'm reading (loading) multiple 10GB+ tables on daily basis from SQL Server to my cloud storage.
It appears those 10GB+ tables aren't 100% immutable: 98-99% of operations are inserts, BUT 1-2% - updates.
Mitigations
Full table load
It requires more resources at Nifi side, but the implementation is very
simple and straighforward
Incremental load by ID column plus capture last day updates with triggers
Performance-wise but it is more complex: it requires configuring triggers on SQL Server side plus reducing multiple versions of row (updates) on ETL layer.
Generate MODIFIEDAT column during replication for all tables
It would be greate, but I think very hard to impossible: we don't have right to modify master database.
CDC looks like not an option because 98-99% of operations are inserts, means CDC will duplicate tables/database volume almost twice. AFAIK, from SQL Server CDC doc and this post it looks like SQL Server CDC can't be configured to capture updates only.

SQL Server queries run fast against prod but not the replication DB

Experienced a 4 min slow down when running reports against SQL Server replication database, specifically created to run reports against.
Prod runs fine, replication use to run in < 1, now takes 4 minutes.
We did two things prior to slowdown:
Truncate log file from 400gb to 100mb
re-created replication job after new data was not happening on Monday
The items were working Friday. From what I can see it the replication is a smaller database as we dont use all the data in prod for reports. I think it might be related to the execution plan being recreated when the new replication job was created but seems very odd, any idea guys?

It's likely that your replicated database doesn't have the same indexes as the primary database. Check that primary key constraints are being replicated (in the article properties), and check that indexes are being replicated.
Take a look at all the indexes and keys in the replicated database and compare them to the source database. It sounds highly likely that they're different.

Will SQL Server transactional replication guarantee the insert order in subscriber

We are using SQL Server 2008 and trying to use transaction replication to reduce the pressure on DB. but we have one concern which is whether it can guarantee the same execution order between publisher and subscriber?
for example, if we run following insert in publisher DB
insert into Table A
insert into Table B
will these 2 insert be executed in the same order in subscriber DB

The answer is yes. In transactional replication, Log Reader Agent is responsible for reading the log file and transfer data which are marked for replication from publication database to distribution database. To do this, Log Reader Agent scans the Log file. Log Reader Agent is a continuous job and there is no way that it will skip some transactions and transfer some other transaction of the same table to the distributor.
On the other hand, Replication Distribution Agent transfers transactions from Distributor database to Subscriber databases. This should also happen in the order the transactions are written to the Distributor because if it is not the case there is no consistency of data.

Yes, rows will be inserted in the same order if both tables are in the same publication.
Log reader agent will read the data from transaction log in the order it was written and push to distribution database.
from there distribution agent will read and apply the commands in the same order to subscriber database.

cdc ON secondary database in SQL Server

I have a cluster of databases, one primary and two secondary. I need to enable CDC on a database, but I want to enable it on one of the secondary databases to eliminate any resource consumption on the primary database (similar to SQL Server secondary database backup). Is this possible to do it and how? If not: can you tell me the best practices for enabling CDC on cluster?

I want to enable it on one of the secondary databases to eliminate any resource consumption on the primary database
This is not possible. CDC writes the changes back to system tables in the target database, so must run against the primary replica. See Replication, change tracking, & change data capture - Always On availability groups

Resychronize merge replication

I have an issue where there was a merge replication between 2 instances for around 10 articles that has now been dropped. I want to recreate the merge replication - I am looking for inputs on the steps/ different options to set it up again and synchronize.
The subscriber is remote and not a part of the LAN. Please note that I have the scripts to create the replication.
This is what I am thinking of doing:
backup current publisher and restore it to the subscriber instance in a different name
restore a copy of the subscriber in a different name
run compare using a tool that generates scripts, like those from red gate
apply the script generated on the restored subscriber db.
After this, what do you think is the best way to set the replication back to running?
Any advise appreciated. thankyou

There is two thing to check before you backup and restore.
Make sure that you have all data from publisher and subscriber in one database. It could be publisher. If you hadve ETLs which loading you publisher and subscribers database from diffeent source this point is pretty important.
run http://technet.microsoft.com/en-us/library/ms188734%28v=sql.105%29.aspx on both publisher and subscriber
Script out all your indexes if you need reduce backup file. You can create them l8r once you will be in sync.
backup db on publisher and restore it pn subscriber
Next
create publication
create snapshot
add login to the access list of your publication
add articles for publication
create script drop/create indexes. Create scrip to drop/create indexes on tables classified as “big data” to prevent snapshotting indexes.
Do this for constraints, too. They slow up your action..
Just drop them all. From step 9
Snapshot your stuff.
Now subscriber
add pull subscription. You have two steps. Script on publisher and script on subscriber.
stop agents on subscriber and change GENERATION_LEVELING_THRESHOLD if you need or change subscriber agent profile.
You can now start pull agents.
Remember about replication index maintenance
Hope that help