SQL Server Trace Files Filling Up Agent Drive - sql-server

Background:
SQL Compliance Manager is collecting files on an Agent Server to audit and once the trace files collect on the Agent the Compliance Manager agent service account moves these files to the Collection Server folder, processes them and deletes them.
Problem:
Over 5 times in the last month, the trace files have started filling up the Agent drive to the point where the trace files have to be stopped by running a SQL query to change the status of the traces. This has also had a knock on effect with the Collection Server and the folder on there starts to fill up excessively and the Collection Server Agent is unable to process the audit trace files. 4/5 times the issue occurred closely after a SQL fail over, however, the last time this trace error occurred there had been no fail over. The only thing that was noticeable in the event logs was that 3 SQL jobs went off around the time the traces started acting up.
Behaviour:
A pattern has been identified which shows on Windows Event Viewer that there is an execution timeout close or at the time the trace files start becoming unwieldy.
Error: An error occurred starting traces for instance XXXXXXXXX. Error: Execution Timeout Expired.
The timeout period elapsed prior to completion of the operation or the server is not responding..
The trace start timeout value can be modified on the Trace Options tab of the Agent Properties dialog in the SQLcompliance Management Console.
Although, I do not believe by just adjusting the Timeout settings will cause for the traces to stop acting in that way, as these are recommended settings and other audited servers have these same settings but do not act in the same way. The issue only persists with one box.
Questions:
I want to find out if anyone else has experienced a similar issue and if so, was the environment the issue happened in dealing with a heavy load? By reducing the load did it help or were there other remediation steps to take? Or does anyone know of a database auditing tool which is lightweight and doesn't create these issues?!
Any help or advice appreciated!

Related

SQL Server Agent randomly stops working however says its running in Services though the agent is dead (Forced restart fixes it)

I've not seen this scenario in all my web searches.
We have several SQL Server Agent jobs that get kicked off by users through various applications.
Quite randomly, these executions will fail. When log in and manually kick the jobs off i'm immediately presented with 'SQL Agent not started' errors.
When I check the agent it says it's running. If I force a restart the problem is fixed.
The random nature of the issue causes testing issues.
The agent is already set up to 'restart if stopped unexpectedly' and 'restart with delay'.
A recent disaster recovery was performed and SQL Server Agent started successfully and declared itself 'running'. However once again it required a manual restart to actually work.
Is this a known issue or bug?
How can I mitigate against this when the agent says its running and all parameters are set correctly and meet the 'internet's' suggestions?

SQL Server Agent Job stops SSIS Step with "unexpected error" and without any error informations

I am dealing with my problem on some Windows Server 2019 (Core) with one running SQL Server 2019 CU4 instance each.
What we try to do
We are currently building a data warehouse with distributed databases. The individual layers of the DWH are located on one database server each. The data exchange between the layers/servers takes place via SSIS ETLs, which use Linked Servers to reach the other layers and drag and drop data. Each layer also has its own SSIS service instance and executes the corresponding SSIS packets.
The SSIS packages are called by SQL Server Agent jobs. We have a job that executes the SSIS packets (#1), which in turn calls another job (#2) as the last step, which after a short wait time executes the calling job (#1). Thus, controlled by schedules, a loop is created and data is continuously transferred with ETLs.
I hope this was not too much unnecessary background
The error
Basically the job is running and there are numerous successful executions. However, we are observing interruptions at job #1 without helpful information regarding the error. This means that the job history log refers to the SSIS log, which again only contains an "unexpected termination". In the SSIS log, we only see behavior that indicates that the ETL packet active at that time stopped after validation. Depending on the log level, nothing is logged at all, not even the execution of single packages of the project. The package where this error occurs is different and not limited to a specific one.
What I have already tried
Re-create the jobs and SSIS Enviroments by hand (scripted before)
Using the 32Bit Runtime
Upgrade the SSIS project/package version to
2019
Increase the log level to "verbose"
Patching the SQL Server to CU4
Save ssis dump files (couldn't find them or they weren't created)
Search Windows and SQL Server Logfiles
Does anyone have some suggestions or some ideas how to become more error specific informations?
Thank you very much and take care :)
UPDATE We have an error message (OLE DB 0xC0202009 and 0X80004005)!
In order to exclude the use of environments as a cause, I manually set the parameters in the SSIS job step instead of overwriting them by selecting an environment.
Long story short: Today it turns out that the parameter for an OLE DB Connection String is not passed correctly.
The following is specified as a parameter in the job step:
However, the following connection string is specified in the context of the error message:
Please note that some arguments are added twice to the parameter (red).
What could have caused that?

Couldn't connect to database when using Top Resource Consumers QueryStore Report

We recently upgraded our SQL server to 2016 and I turned on QueryStore to do the analysis that it provides. I'm encountering a problem where, even if the time period of the report is Last hour, it will generate a message that says "Couldn't connect to database" even when running it on the database server itself. Sometimes if I keep refreshing the report it will eventually display some data, but it's intermittent at best. I'm running SSMS 17.5 on a sql server 2016 server.
We are having a somewhat similar issue with another program that connects to the database where it will sometimes not be able to connect, but every time I run my queries in SSMS, run reports in SSRS, or even use activity monitor, I never see any connection drops, so I'm not sure if it is related.
Thank you in advance for any help!
I find it works fine with the statistic set to Avg, StdDev, or Total. Max and Min give the error.
I found this happens when the query store runs out of space and gets into cleanup mode.
In database properties in SSMS try playing around with Query Store settings: for how many days it stores the query stats and does it get into "size cleanup" mode. More info on how to keep it adjusted: https://learn.microsoft.com/en-us/sql/relational-databases/performance/best-practice-with-the-query-store?view=sql-server-ver15#Configure

Sql Server Reporting Service (SSRS) 2005 Operation timeout issue

I know similar questions have been asked before...
I am using SQL server 2005, with SSRS 2005 installed on the same box. (aka. production DB, Report DB/TempDB, Database engine, and SSRS all in the same box).
We have about 200 reports deployed in the box.
SSRS/DB is running on a W2k3 64-bit VM.
Now the problem...
Occasionally almost on a daily basis our users get the 'operation timeout' error (error in XML document....). At first I thought it was a report size problem, but then when I try the Report Manager URL (http://<>/reports), nothing appears on the browser. The only thing I can do is to recycle the Report server IIS pool and it will work again. Everytime when the 'operation timeout' happens, the Report Manager URL will not work, and I can't find any logs in IIS to indicate there's a problem.
I researched on the net and found that some people have put a dummy report as part of the SQL server agent job which runs every 10 minutes from 9-5 to 'warm up' the SSRS. The dummy report made a small connection to the DB on one row from a very small table. The operation timeout problem seems to have disappeared for 95% of time, but it still happens. Strange enough, when the operation timeout problem happens, I notice the dummy report job has also stopped working. In this case, I had to recycle the IIS pool, and start the SQL server job again, and then SSRS will work again (until the same problem happens next time)
The error I got from the SQL server job is:
System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host
However I am totally confused by how the IIS issue on the report server somehow affects the SSRS job. Maybe I am on the wrong track but that's bizzare.
My observation so far is if it takes forever for the Report Manager URL (http://<>/reports) to appear it is a bad sign that something has gone terribly wrong on SSRS.
I have also added a new task which call the SSRS Report Manager http://<>/reports URL using PowerShell in order to 'warm-up' the IIS but it does not seem to make much difference.
Can someone point me to the right direction? Thanks. WM
In the past, after much research, I've found memory allocation for SSRS to be the root of many issues. You can try this.
Add the following into the <Service> node in the rsreportserver.config file
<WorkingSetMaximum>4000000</WorkingSetMaximum>
The file is typically in c:\program files\Microsoft SQL Server\MSRS11.iMIS\Reporting Services\ReportServer
This sets the maximum memory available for the report which also set the minimum memory to 60% of the maximum.
https://msdn.microsoft.com/en-us/library/ms159206(v=sql.110).aspx

Replication: Cannot execute 'sp_replcmds' on <ServerName>, all simple solutions already explored

We ran out of space on our Production Server and during this time we started getting: "Cannot execute 'sp_replcmds' on " on Replication. The Distributor is the Publisher as well.
After fixing the space issue - this is the only error I'm getting on my Replication
We have five databases set-up for Replication. The four small databases work with no error messages except that the Last Synchronization Status says the following: "The process could not connect to Distributor "
The one large database gets the error in the subject and also that it cannot connect to the Distributor . The Error Code is: MSSQL_REPL22037
I checked the DBOwner and it is set up correctly. I stopped and started the Log Reader Agents too many times to count. I restarted the MSSQLServer Agent Processes on the Subscriber Server as well.
I solved this one myself. After all the other suggestions
It was definitely the BatchSize and the QueryTimeOut properties.
In order to change this:
Launch Replication Monitor.
Expand to the Publication in question.
Go to Agents Tab.
Right Click on Log Reader Agent > Agent Profile.
Create a New Agent Profile with the new parameters you need.
Set the New Profile to 'Use for this Agent'
Restart the Log Reader Agent and just wait.
Rinse/Repeat until you get the right amount.
I set the Timeout to 2400 and the BatchSize to 100 from 1800 and 500 respectively.

Resources