NServiceBus Expired Message Purge Fails, Unable to Remove Messages, Logs WARN - sql-server

We have some NServiceBus Handlers (6.4.0) using SQL Server Transport (3.1.2) that run fine, but their expired message purge cycle always fails to remove any rows, and a WARN is always logged about this. Contrary to the message, I don't see any messages accumulating in the endpoint table. The image below is of the handler running as a console app, logging the WARN.
WARN when running as console app
Environment Oddities: Our transport (and user data) database is in Compatibility Mode 80 (i.e. SQL Server 2000 mode) even though the server instance is 2008 R2. We had some trouble with the transport tables because the the server complained that ARITHABORT had to be on to support the index used on those tables, but our corporate software demands it be off by default. To get around changing it globally, in EndpointConfig we use the 'UseCustomConnectionFactory()' to supply a function which creates new SqlConnections and, after creation, runs SET ARITHABORT ON on the connection before returning it for use by the application. That seemed to solve that issue - but now we get the purge failure and WARN. The actual error message mentions "timeout" and 'server not responding' - but the database is continually available, query-able, and in use when this is occurring. Additionally, this occurs when volume is very low - as low as 2 or 3 messages per minute.
Any ideas on what might be wrong, how we might debug further, or how to resolve the issues, are very much appreciated.

Related

How to resolve DB connection invalidated warning in Airflow Scheduler?

I am upgrading our Airflow instance from 1.9 to 1.10.3 and whenever the scheduler runs now I get a warning that the database connection has been invalidated and it's trying to reconnect. A bunch of these errors show up in a row. The console also indicates that tasks are being scheduled but if I check the database nothing is ever being written.
The following warning shows up where it didn't before
[2019-05-21 17:29:26,017] {sqlalchemy.py:81} WARNING - DB connection invalidated. Reconnecting...
Eventually, I'll also get this error
FATAL: remaining connection slots are reserved for non-replication superuser connections
I've tried to increase the SQL Alchemy pool size setting in airflow.cfg but that had no effect
# The SqlAlchemy pool size is the maximum number of database connections in the pool.
sql_alchemy_pool_size = 10
I'm using CeleryExecutor and I'm thinking that maybe the number of workers is overloading the database connections.
I run three commands, airflow webserver, airflow scheduler, and airflow worker, so there should only be one worker and I don't see why that would overload the database.
How do I resolve the database connection errors? Is there a setting to increase the number of database connections, if so where is it? Do I need to handle the workers differently?
Update:
Even with no workers running, starting the webserver and scheduler fresh, when the scheduler fills up the airflow pools the DB connection warning starts to appear.
Update 2:
I found the following issue in the Airflow Jira: https://issues.apache.org/jira/browse/AIRFLOW-4567
There is some activity with others saying they see the same issue. It is unclear whether this directly causes the crashes that some people are seeing or whether this is just an annoying cosmetic log. As of yet there is no resolution to this problem.
This has been resolved in the latest version of Airflow, 1.10.4
I believe it was fixed by AIRFLOW-4332, updating SQLAlchemy to a newer version.
Pull request

Email Router - when updating Mailbox's email address returns SQL timeout error

I have an on-premise CRM 2016 instance and I can't receive any incoming emails inside of it even though when I run the test access says everything is good.
First, I'm unable to change a queue record email address, because I keep getting a SQL timeout error (doesn't matter how much time you increase the timeout it will never change) but if I try to change any other field it works and saves (but not the email field of course).
The same with the Mailbox's records, when I try to change the email it returns a SQL timeout error.
So what I did was change these emails by SQL queries, but after that the emails still won't create inside CRM.
It shows the next warning log in the event viewer:
35241 - The recipients for the email message with subject "[x]" in mailbox [email address] did not match any known records.
I'm running out of choices here, when I run the diagnosis tool on my organization it's performance is good but there must be something obstructing the communication with the SQL? Any clues?
SQL timeout error:
Unhandled Exception: System.ServiceModel.FaultException`1[[Microsoft.Xrm.Sdk.OrganizationServiceFault, Microsoft.Xrm.Sdk, Version=8.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]]: SQL timeout expired.Detail:
-2147204783
SQL timeout expired.
2018-10-10T14:14:15.5749939Z
I got the answer from Microsoft Community Forumns, thanks to Radu Chiribelea:
It's not enough to change the email address in SQL in it's base table for a record, so that this can be used for email tracking. There are other references as well - for example the EmailSearchBase. This is why you need to let the platform handle your changes.
You biggest issue here is the SQL Timeout and that is what you need to address. Since this occurs at a Create / Update I suspect there might be a deadlock somewhere. Do you have any plug-ins or workflows triggered at the time you create / update? If you disable those, do you still see the issue?
Can you enable a CRM Platform trace at a Verbose Level while reproducing the issue? This would give you a better overview of the actual timeout and you can then start from there to tackle it.

Merge replication unintialized subcription is expired or does not exist

I am trying to set up a merge replication using web synchronization between a publishing SQL Server 2012 standard and subscribing SQL Server 2012 Express. After following the instructions provided at Technet, I am stuck on this:
Source: Merge Process(Web Sync Server)
Number: -2147200985
Message: The subscription to publication 'MyMergePublication' has expired or does not exist.
I already verified that SSL certification are good, that I can browse to the publishing machine's URL https:\\mycomputer\replisapi.dll and get the expected output. I already verified that snapshot was set up and I took a giant hammer & use an administrator account to run the pool identity which is really bad security-wise but wanted to validate that it was not security that was tripping me up.
To further the mystery, when I try and fail to sync, the publisher acknowledges that a new subscriber has been registered, but it cannot get the snapshot at all and thus subscriber database is still empty.
On the replication monitor, there are no failed synchronization history, or any errors; all it has to say is that the subscriber is uninitialized, and no more.
Turning up the verbosity of the merge agent, I saw some sql being executed and tried replicating the sql and i found this was failing with same error:
{call sys.sp_MSgetreplicainfo(?,?,?,?,?,?,?,90)}
I called it with only the 3 mandatory parameters supplied and it would fail. That is despite the prior call sp_helpmergepublication does return a row for that publication. Oddly, the content of sp_helpmergepublication does not match what I configured for the subscription (e.g. it says web url is null when viewing the properties correctly shows the web url being set). Not sure that is significant.
The content of sp_MSgetreplicainfo contains a call to another system sprocs that I cannot run for some reason (says not found) so I'm not sure what is actually going on here.
Any clues would be greatly appreciated.

What causes NHibernate “Internal Connection Fatal” errors?

When NHibernate’s log level is set to “DEBUG”, we start seeing a bunch of “Internal Connection Fatal” errors in our logs. It look like NHibernate dies about ½ way through processing a particular result set. According to the logs, the last column NHibernate reads appears to have garbage in it that isn’t in the underlying data.
The issue seems to go away when either:
The log level is set back to “ERROR”.
The view being queried is changed to return less data (either less rows OR null or blank values for various columns).
We’re using ASP.NET MVC, IIS7, .NET Framework 4.5, SQL Server 2012, log4net.2.0.2 and NHibernate.3.3.3.4001.
I guess my real concern is that there is some hidden issue with the code that the added strain of logging is bringing to light, but I'm not sure what it could be. I have double checked the NHibernate mappings and they look good. I've also checked to ensure that I'm disposing of the NHibernate session at the end of each request. I also tried bumping up the command timeout, which didn't seem to make a difference.
If one of the columns is of non-simple type (binary, text, etc.) nh may be having problems populating a property.
Turns out the connection from our dev app server to our dev database server was wonky.
From the dev app server, open SSMS and try to connect to the dev database server.
Sometimes we get the "internal connection fatal error", sometimes we don't.
Issue is possibly caused by TCP Chimney Offload/SQL Server incompatibility.
Check the following KB Article for possible solutions:
http://support.microsoft.com/kb/942861
For Windows 7/2008 R2:
By default, the TCP Chimney Offload feature is set to Auto. This means
that the chimney does not offload all connections. Instead, it
selectively offloads the connections that meet the following criteria:
The connection is established through a 10 gigabits per second (Gbps)
Ethernet adapter.
The mean round-trip link latency is less than 20
milliseconds.
At least 130 kilobytes (KB) of data were exchanged over
the connection.
Last condition is triggered in the middle of dataset, so you see garbage instead of real data.

Error 17886 - The server will drop the connection

We are running a website on a vps server with sql server 2008 x64 r2. We are being bombarded with 17886 errors - namely:
The server will drop the connection, because the client driver has
sent multiple requests while the session is in single-user mode. This
error occurs when a client sends a request to reset the connection
while there are batches still running in the session, or when the
client sends a request while the session is resetting a connection.
Please contact the client driver vendor.
This causes sql statements to return corrupt results. I have tried pretty much all of the suggestions I have found on the net, including:
with mars, and without.
with pooling and without
with async=true and without
we only have one database and it is absolutely multi-user.
Everything has been installed recently so it is up to date. They may be correlated with high cpu (though not exclusively according to the monitors I have seen). Also correlated with high request rates from search engines. However, high cpu/requests shouldn't cause sql connections to reset - at worst we should have high response times or iis refusing to send response.
Any suggestions? I am only a developer not dba - do i need a dba to solve this problem?
Not sure but some of your queries might cause deadlocks on the server.
At the point you detect this error again
Open Management Studio (on the server, install it if necessary)
Open a new query window
Run sp_who2
Check the blkby column which is short for Blocked By. If there is any data in that column you have a deadlock problem (Normally it should be like the screenshot I attached, completely empty).
If you have a deadlock then we can continue with next steps. But right now please check that.
To fix the error above, ”MultipleActiveResultSets=True” needs to be added to the connection string.
via Event ID 17886 MSSQLServer – The server will drop the connection
I would create an eventlog task to email you whenever 17886 is thrown. Then go immediately to the db and execute the sp_who2, get the blkby spid and run a dbcc inputbuffer. Hopefully the eventinfo will give you something a bit more tangible to go on.
sp_who2
DBCC INPUTBUFFER(62)
GO
Use a "Instance Per Request" strategy in your DI-instantiation code and your problem will be solved
Most probably you are using dependency injection. During web development you have to take into account the possibility of concurrent requests. Therefor you have to make sure every request gets new instances during DI, otherwise you will get into concurrency issues. Don't be cheap by using ".SingleInstance" for services and contexts.
Enabling MARS will probably decrease the number of errors, but the errors that are encountered will be less clear. Enabling MARS is always never the solution, do not use this unless you know what you're doing.

Resources