DMS is failing sometimes and its not read the data in the next cyle - cdc

some time CDC DMS is failing with error 'Failure in resolving stream position by timestamp'
I have a DMS CDC task and it will be run every 4 hours. DMS is able to read transaction log and it place the file in s3 for some runs its failing with error 'Failure in resolving stream position by timestamp'. after that data is not coming in the next cycle. And we have set up polling interval to 24 hours.

Related

How to resolve DMS failure to access LSN issue in SQL Server?

I am trying to use DMS to capture change logs from the SQL Server and write them to S3. I have set up a long polling period of 6 hours. (AWS recommends > 1 hour). DMS fails with the below error when the database is idle for a few hours during the night.
DMS Error:
Last Error AlwaysOn BACKUP-ed data is not available Task error notification received from subtask 0, thread 0
Error from cloud watch - Failed to access LSN '000033fc:00005314:01e6' in the backup log sets since BACKUP/LOG-s are not available
I am currently using DMS version 3.4.6 with multi-az.
I always thought the DMS reads the change data immediately after updating the T log with the DML changes. Why do we see this error even with a long polling period? Can someone explain why this issue is caused? how we can handle this?

how to resolve ERROR: could not start WAL streaming: ERROR: replication slot "xxx" is active for PID 124563

PROBLEM!!
After setting up my Logical Replication and everything is running smoothly, i wanted to just dig into the logs just to confirm there was no error there. But when i tail -f postgresql.log, i found the following error keeps reoccurring ERROR: could not start WAL streaming: ERROR: replication slot "sub" is active for PID 124898
SOLUTION!!
This is the simple solution...i went into my postgresql.conf file and searched for wal_sender_timeout on the master and wal_receiver_timeout on the slave. The values i saw there 120s for both and i had to change both to 300s which is equivalent to 5mins. Then remember to reload both servers as you dont require a restart. Then wait for about 5 to 10 mins and the error is fixed.
We had an identical error message in our logs and tried this fix and unfortunately our case was much more diabolical. Putting the notes here just for the next poor soul but in our case, the publishing instance was an AWS managed RDS server and it managed (ha ha) to create such a WAL backlog that it was going into catchup state, processing the WAL and running out of memory (getting killed by the OS every time) before it caught up. The experience on the client side was exactly what you see here - timeouts and failed WAL streaming. The fix was kind of nasty - we had to drop the whole replication link and rebuild it (fortunately it was a test database so not harm done but it's a situation you want to avoid). It was obvious after looking on the publisher side and seeing the logs but from the subscription side more mysterious.

SQL Server Trace Files Filling Up Agent Drive

Background:
SQL Compliance Manager is collecting files on an Agent Server to audit and once the trace files collect on the Agent the Compliance Manager agent service account moves these files to the Collection Server folder, processes them and deletes them.
Problem:
Over 5 times in the last month, the trace files have started filling up the Agent drive to the point where the trace files have to be stopped by running a SQL query to change the status of the traces. This has also had a knock on effect with the Collection Server and the folder on there starts to fill up excessively and the Collection Server Agent is unable to process the audit trace files. 4/5 times the issue occurred closely after a SQL fail over, however, the last time this trace error occurred there had been no fail over. The only thing that was noticeable in the event logs was that 3 SQL jobs went off around the time the traces started acting up.
Behaviour:
A pattern has been identified which shows on Windows Event Viewer that there is an execution timeout close or at the time the trace files start becoming unwieldy.
Error: An error occurred starting traces for instance XXXXXXXXX. Error: Execution Timeout Expired.
The timeout period elapsed prior to completion of the operation or the server is not responding..
The trace start timeout value can be modified on the Trace Options tab of the Agent Properties dialog in the SQLcompliance Management Console.
Although, I do not believe by just adjusting the Timeout settings will cause for the traces to stop acting in that way, as these are recommended settings and other audited servers have these same settings but do not act in the same way. The issue only persists with one box.
Questions:
I want to find out if anyone else has experienced a similar issue and if so, was the environment the issue happened in dealing with a heavy load? By reducing the load did it help or were there other remediation steps to take? Or does anyone know of a database auditing tool which is lightweight and doesn't create these issues?!
Any help or advice appreciated!

Service Broker External Activator response take long

I have two databases on SQL Server 2014, SourceDB and LogDB. On SourceDB, Service Broker; and on server, Service Broker External Activator service are activated.
On SourceDB I have TargetQueue of which a table's (Product) insert trigger sends changes on TargetQueue and TargetQueue has Event notification which nudges my external exe client. Inside exe client I finally dequeue data via WAITFOR(RECEIVE TOP (1)).. and log them directly to LogDB.
So, when I start the SBEA service and on very first insertion into table a/a few record (after delete all records), TargetQueue immediately filled but the interval from time of insertion to SourceDB till insertion to LogDB is approx 3-6 seconds, event notification based time consumption here I guess, not sure. For further insertions after this, the interval becomes 100ms as seen below.
First
Further
Why is the first insertion take too long, why after delete all records of table, it becomes to take long again? Why, further ones take shorter than the first?
Can I decrese the interval under 10ms as I can achieve the almost same structure with SQLCLR under 10ms and the fastest response is crucial for my application as well? (Both structures are on same SQL Server Instance works locally)
You can streamline the process by ditching the External Activator and the Event Notification. Instead have your program continuously running WAITFOR (RECEIVE directly on the target queue in a loop.
Here's a sample to get you started: https://code.msdn.microsoft.com/Service-Broker-Message-e81c4316

MSSQL Backup Immediately Suspended [BULKOP_BACKUP_DB]

Our production database server has stopped running the backup maintance plans...
The server has plenty of space, upon googling and futher investigation it appears the backups are being halted they moment they are started.
Running "Select * from Sys.dm_Exec_requests where command = 'Backup Database'" indicates the backup is being suspended due to 'DATABASE: 5 [BULKOP_BACKUP_DB]' with wait type of 'LCK_M_U'
I have tried searching for an answer to this, but nothing seems to apply in this case.
As i stated the server has plenty of disk space, and has been restarted, yet the backups are still immediatly suspened.
I'm all out of ideas, and would apreciate any input into helping me fix this 'issue'
Update: It seems it is waiting for session 77 to finish which is another backup, but it is 'Killed/Rollback' with percentage complete of 76% (doesnt appear to be changing) and a wait time of 267328092, trying to kill this prcoess results in 'SPID 77: transaction rollback in progress. Estimated rollback completion: 76%. Estimated time remaining: 86124 seconds.'
Update: Upon installing updates, trying some fixes, it appeared fixed, however on its second attempt it also stopped processing the backup, this time with a wait type of ASYNC_IO_COMPLETION...
any ideas?

Resources