Getting Alert that Backup Log Failed but it didn't - sql-server

I'm migrating databases from SQL Server 2008 R2 to a new server running SQL Server 2012. I set up an alert for any severity >= 16. I have a maintenance plan that includes a log backup of all user databases every 5 minutes. After restoring about 10 databases to the new server, I started getting an alert every 30 minutes that says:
DESCRIPTION: BACKUP failed to complete the command BACKUP LOG MyDatabaseName. Check the backup application log for detailed messages.
COMMENT: (None)
JOB RUN: (None)
I searched the logs and there is nothing about a failed backup, and all the backups are fine. I get the alert every 30 minutes, so it's not happening on all of the log backups because they run every 5 minutes. And it's only for one or sometimes two databases out of the 10 that have been restored onto the new server.
I would greatly appreciate anyone that can point me in the right direction to start troubleshooting this.

The maintenance plan runs via a SQL Server Agent job. Check the history of the job. Any failures might show there.
Error level 16 is not considered critical and can be fixed by the user.
Just setup the following to monitor all alerts > level 11.
1 - Database mail
http://craftydba.com/?p=1025
2 - Operator
http://craftydba.com/?p=1085
3 - Alerts
http://craftydba.com/?p=1099
Next time you get a alert, you should get an email with details.
If you want to be real fancy, you can have the alert call a job. Log the alert in the APPLICATION log and then send the email.

Related

sql server pretends to do a log shipping to himself

I have a sql server 2017 config:
2 nodes on a lan synced in an AG: zsbe-eve-db-01/02,
1 node in a remote site: zsbe-rui-db-01 (standby/readonly),
one database named "Primary" (it is stupid, I know, I didn't create it).
Master node zsbe-eve-db-02 is using log shipping (every minute) to send deltas to zsbe-rui-db-01.
Log shipping works: adding 1 row in db in zsbe-eve-db-02, and after 1 minute the row can be selected on zsbe-rui-db-01.
Problem:
In SQL server log of zsbe-eve-db-02, we get these 2 errors every minute:
The log shipping secondary database ZSBE-EVE-DB-02.Primary has restore threshold of 60 minutes and is out of sync. No restore was performed for 12279 minutes. Restored latency is 0 minutes. Check agent log and log shipping monitor information.
Error: 14421, Severity: 16, State: 1.
Seems that zsbe-eve-db-02 thinks it holds secondary database (although it is master).
ZSBE-EVE-DB-02 is primary and read write:
In log shipping config screen of zsbe-eve-db-02 only 1 rui node appears:
Can someone explain me where sqlserver configured a kind of log shipping from a node to himself?
Transaction log shipping report is weird:
No information appears on zsbe-rui-db-01.
We dropped the log whole shipping config, all related jobs and alerts, on all servers, and these messages are still there. The log shipping report shows only the line in red (alert) and we don't know how to clear it:
Any hint?

SQL Server Managed Backup for Windows Azure (SSMBackup2WA) stuck waiting for progress update

I have a database running on an azure vm with sql server. The db is in full recovery mode. The backup is configured through the web interface. Database and log backups have been working flawlessly for years. But recently the log backup was interrupted halfway through and the log backup process somehow got stuck. The following event has been logged every 5 minutes since then (reading log with managed_backup.sp_get_backup_diagnostics):
[SSMBackup2WAAdminXevent] Database Name = DB, Database ID = 777, Stage =
VerifyJobOutcome, Error Code = 0, Error Message = Warning, Additional Info = A
progress update hasn't been received from SQL Server in more than 30 minutes
for log backup. SSMBackup2WA will continue to wait.
SSMBackup2WA seem to be stuck waiting for a progress update never being received. This has resulted in no log backups being taken. The database backup have continued running without problem.
I have trouble finding the job/task used by SSMBackup2WA. I understand its not in the usual batch of SQL Server Agent jobs but somehow hidden.
My idea is to somehow cancel the existing job that is stuck in waiting loop but I have not figured out how.
I have tried to "reset" the backup process by turning off the backup and then turning it on again but that did not help.
I have no possibility to restart the sql server (and I don't know if that would help).
So since no one seemed to have an answer to this one I resorted to restarting the SQL-server. And after the restart the transaction log backup started working again!
What is interesting is the following log that appeared in the application event log during the restart. It does seem like there was a thread hanging indefinitely, waiting for an status update that never arrived. The restart seems to have taken care of it by killing this status thread and not restarting it again in the erroneous state it had ended up in.
Log Name: Application
Source: Microsoft SQL Server Automated Backup
Date: 1/15/2022 11:16:20 AM
Event ID: 57007
Task Category: None
Level: Warning
Keywords: Classic
User: N/A
Computer: wn-sqlserver1
Description:
[Warning] AutomatedBackupStatusMonitorError:
System.Exception:
Error in auto-backup status monitor thread --->
Microsoft.SqlServer.Management.IaaSAgentSqlQuery.Contract.IaaSAgentSqlQueryException:
A network-related or instance-specific error occurred while
establishing a connection to SQL Server. The server was not
found or was not accessible. Verify that the instance name
is correct and that SQL Server is configured to allow remote
connections. (provider: Named Pipes Provider, error: 40 - Could
not open a connection to SQL Server) --->

SQL Server Trace Files Filling Up Agent Drive

Background:
SQL Compliance Manager is collecting files on an Agent Server to audit and once the trace files collect on the Agent the Compliance Manager agent service account moves these files to the Collection Server folder, processes them and deletes them.
Problem:
Over 5 times in the last month, the trace files have started filling up the Agent drive to the point where the trace files have to be stopped by running a SQL query to change the status of the traces. This has also had a knock on effect with the Collection Server and the folder on there starts to fill up excessively and the Collection Server Agent is unable to process the audit trace files. 4/5 times the issue occurred closely after a SQL fail over, however, the last time this trace error occurred there had been no fail over. The only thing that was noticeable in the event logs was that 3 SQL jobs went off around the time the traces started acting up.
Behaviour:
A pattern has been identified which shows on Windows Event Viewer that there is an execution timeout close or at the time the trace files start becoming unwieldy.
Error: An error occurred starting traces for instance XXXXXXXXX. Error: Execution Timeout Expired.
The timeout period elapsed prior to completion of the operation or the server is not responding..
The trace start timeout value can be modified on the Trace Options tab of the Agent Properties dialog in the SQLcompliance Management Console.
Although, I do not believe by just adjusting the Timeout settings will cause for the traces to stop acting in that way, as these are recommended settings and other audited servers have these same settings but do not act in the same way. The issue only persists with one box.
Questions:
I want to find out if anyone else has experienced a similar issue and if so, was the environment the issue happened in dealing with a heavy load? By reducing the load did it help or were there other remediation steps to take? Or does anyone know of a database auditing tool which is lightweight and doesn't create these issues?!
Any help or advice appreciated!

Replication: Cannot execute 'sp_replcmds' on <ServerName>, all simple solutions already explored

We ran out of space on our Production Server and during this time we started getting: "Cannot execute 'sp_replcmds' on " on Replication. The Distributor is the Publisher as well.
After fixing the space issue - this is the only error I'm getting on my Replication
We have five databases set-up for Replication. The four small databases work with no error messages except that the Last Synchronization Status says the following: "The process could not connect to Distributor "
The one large database gets the error in the subject and also that it cannot connect to the Distributor . The Error Code is: MSSQL_REPL22037
I checked the DBOwner and it is set up correctly. I stopped and started the Log Reader Agents too many times to count. I restarted the MSSQLServer Agent Processes on the Subscriber Server as well.
I solved this one myself. After all the other suggestions
It was definitely the BatchSize and the QueryTimeOut properties.
In order to change this:
Launch Replication Monitor.
Expand to the Publication in question.
Go to Agents Tab.
Right Click on Log Reader Agent > Agent Profile.
Create a New Agent Profile with the new parameters you need.
Set the New Profile to 'Use for this Agent'
Restart the Log Reader Agent and just wait.
Rinse/Repeat until you get the right amount.
I set the Timeout to 2400 and the BatchSize to 100 from 1800 and 500 respectively.

merge replication - can't create snapshot - timeout - sql server 2008

I have a SQL Server 2008 database, and I need a mergereplication because i want to sync with mobile devices afterwards.
So I created a replication but when it comes to start the snapshotagent, the agent tries to start for about 20 minutes and then it shows the message
The replication agent has not logged a progress message in 10 minutes.
This might indicate an unresponsive agent or high system activity.
Verify that records are being replicated to the destination and that
connections to the Subscriber, Publisher, and Distributor are still
active.
There aren't any other errormessages, neither in the snapshot-agent-status-window nor in the agent-log-window.
I don't have the administrator of the domain, but the local administrator and a domainuser with admin-privilegs. Both have all rights to database, are in the access-list of the replication.
The server agent runs on the local administrator-account and there are 3 MergeReplications on the server, working
The job runs also under the local administrator.
Thank you for your help, Karl
So it works again...
Maybe someone else has got the same issue one day, so i post the solution here:
I researched on the server and found out, the sql server service is running under a local user. The reason for this is, that there were problems with the backupsystem, used by our customers and so they changed it years ago.
Because of the local user account a 15404-Error occures.
Knowing, that i mustn't use domain-accounts, I also solved the initial problem with my snapshot-agent. I searched for hours (nearly days ;) ) and it was just this little change:
When the Replication is created, the job is created too. The job has three steps. The Job-owner is the local-admin, also for the server-agent-service. But the second step of my job (replictionsnapshot) has one setting: run as. And by default this isn't the job-owner but the user running the creation, in my case my domain-account.
Now, that I set it to the local-administrator as well everything works fine again.
Thanks, Karl
I had the same issue, And the below fixed the issue. The replication agent was timing out after 10 minutes and changing the heartbeat from 10 to 30 minutes solved the issue,
Run the below command
exec sp_changedistributor_property #property = 'heartbeat_interval', #value = 30;
and then restart the sql agent on the subscriber to continue syncing.

Resources