SQL Server 2005 Replication Stops after a minute without an error - sql-server

Background:
I have a SQL Server 2005 setup with master, slave1, slave2 replication set up as a pull replication from slaves. The distribution database resides on the slave1 machine, both slaves pull.
A problem began today where the replication on slave1 simply stops running. It claims that it completed successfully, but it does not restart, and manually starting the process finishes in roughly one minute, again without an error message.
Replication is running fine on slave2, but I can't seem to figure out what's wrong on slave1. I've tried the obvious Windows debugging 101: "restart the machine" technique, but to no avail.
Has anyone encountered this before Does anyone have an idea of what I could check or change to get it working again? I'm especially at a loss as SQL Server claims that the job is just finishing successfully.

Though I'm unsure of why this began occurring. It appears to be due to the use of a custom SQL Server Replication Agent profile. Switching to using the default got it working again.

Related

SQL Server [Distribution clean up: distribution] job suspended

I have several publications in SQL Server 2016 in my test environment, when my distribution clean up job runs it running forever without deleting anything.
I did some digging around and discovered that this job is actually blocked by one repl-logreader of the server, however, when I checked the replication monitor for all the publications of the server, all of them showing good status without any latency. Their undistributed commands are '0'.
What else that I should look at to solve this issue?

SQL Server job hangs when calling an SSIS package until agent is restarted

I have googled and read many questions/answers, but only one question has ever sounded exactly the same and it did not have an answer.
The situation:
My group has several SQL Servers that are running SQL Server 2017. They are configured virtually identically.
These servers are build boxes, meaning they pull data from a data ware house, or an extract file, run some ETL processing and then push to a prod box. SSIS packages are deployed on the box where the DB resides.
Just over a month ago (with no updates having occurred), one of these servers started having an issue where all the jobs that ran an SSIS package would "hang" on the step that ran the package. Any other step runs fine. But a job step that runs a package (all jobs do this), will not even start the package. The package shows no indication in the executions that anything has even tried to start it.
If the user executes the deployed package it will run successfully.
The only thing that will "fix" the issue is restarting the agent service.
I created a simple job to run a simple package every 5 mins. It had been running for about a week, the last time it ran was 4/11/2021 at 2:40am, the 2:45 run hung. I could find nothing in the event logs that occurred at that time. The server was rebooted as a normal scheduled process at 3:15 and was online by 3:25 because that is the next time it tried to run and it again just hung. So even a server reboot did not fix the issue.
I am at my wits end, since there is no error (the job hangs and the package does not even start) there is no logging that I can find that is showing any issues, I am at a loss as to what might cause this.
Thanks in advance.
Take a look at the SSISDB catalog database on each/all the servers involved. Has it grown exponentially and needs the history etc. cleared down or settings changed? How big are the transaction logs for those databases etc.?

Issues with LogShipping in SQL Server 2008 R2 SP3

I've got two SQL instances in two separate servers,
The first one has two databases, A and B for that purpose, that I'm logshipping to the secondary server; in database A it works like a charm, however, when logshipping database B we encounter errors, the database enters suspect mode, haven't been able to figure out why.
The secondary server used to be the main one, and it had logshipping to a third server, which stopped working. When the secondary server was our main one we never encountered logshipping issues, both databases A and B were logshipping without issues for a long period of time.
We installed the same SQL Server version on the new server as we had on the previous one and reconfigured the SQL Agent Jobs to get it up and running, database A has encountered no errors so far, however we can't seem to get the second database running for longer than a few hours.
All research I've done has led no-where, there's a few things that I've seen suggested online and we've ran our tests with no success so far.
To point out: Database A is much larger than B, however this one has had no issues whatsoever.
We are using SQL Server 2008 R2 with Service Pack 3.
Any help to anyone who might have encountered this in the past would be very appreciated.
Thanks

SQL server temporary connection failure from particular clients

We have a client deployment of our software that is showing intermittent SQL server connection failures, and we are struggling to understand them.
Our system consists of a SQL Server DB (2012) and 14 identical engines, each installed on a Windows 2012 VM. Each of these was created from the same template so they should be identical. The engines consist of a Windows service that connects to the DB on startup by reading a single row from a table. If the connection fails they will wait a few seconds and try again, until they get a connection.
In this particular case, the VMs were all rebooted due to a Windows Update. (The SQL server had the update/reboot about 12 hours before). They came online within a few minutes of each other. 12 of the engines started up without any problem. Two of them, however, failed to connect to the DB with:
"The underlying provider failed on Open."
Those two engines then started to poll, and continued to get this error for many hours. The rest of the engines had started up and were fine. We have a broker service too that was accessing the DB throughout and showed no connection issues.
When the client noticed this issue, they restarted the engine services on the two problem VMs, and the two engines connected to the DB just fine.
We are trying to understand what could have happened here. I guess my main questions are:
What could be an explanation of why 12 connections succeed and two fail? There's absolutely no difference as far as we know between the engines. The query itself is very simple.
Why did the connection continue to fail for those two engines until the service was restarted? This suggests to me that there is some process-level failed state that is only cleared when restarting the services. I've looked at the code to see if it was reusing the connections. It uses Entity Framework to read the single table row, and we create a fresh DbContext each time. I don't understand how this could go wrong.
We noted that there was a CheckDb operation proceeding on the DB around the time the services were coming up, and we wondered if this could be related to the issue. However, the client says that this runs every night and hasn't caused problems in the past. And it wouldn't explain why the engines didn't come back up again.
Thanks in advance for any help.

SQL Server 2008 R2 Job Launched Step 1 hundreds of times

I have an ETL SSIS package that is scheduled via job to run nightly at 7pm. It is the only step in the job, and the failure action is "quit the job reporting failure". The server is Windows Server 2008 R2, and the SQL Server version is 2008 R2. There is also an instance of SQL Server 2012 installed on this server, but the services are not started for that instance.
I've made no changes to the job, package, or server, and tonight it behaved strangely. When I look at the history of the job and expand tonight, it shows starting step 1 over 400 times, all at exactly 7 PM. It looks like it just kept launching it until the transaction log filled the entire drive and had no more space to grow, then exited the job reporting failure. I shrunk the transaction log by setting recovery mode to simple and running DBCC SHRINKFILE. I then restarted all of the SQL services for that instance and re-ran the job. So far, it seems to be running as expected, although I suppose time will tell.
I did a search of stack overflow and have seen nothing like this mentioned. We're actually starting a project to virtualize the box, then upgrade to 2012, so this may end up being one of those oddball things that never happens again, but I thought I'd ask in case anyone has any idea why this might have happened.
open the job step and go to the advanced tab. Look at the retry attempts. could it be that it has a big number? this would make the step run many times if it fails.
:

Resources