Every 1 in 30 connections I get Win32Exception: Unknown location error. Azure web app to AWS SQL DB - sql-server

We have a couple of .NET Core 3.0 Web Apps (UK South) that connect to a MS SQL 2016 database which is running on an Amazon Windows Server 2016 Datacenter (EC2 instance). We connect via an Azure Relay/Hybrid Connection which is installed on the SQL Server.
It has been working fine for over a year with no errors, but recently we've started getting the following error, about 1 in every 30 connections:
An unhandled exception occurred while processing the request.
Win32Exception: An existing connection was forcibly closed by the remote host.
Unknown location
SqlException: A connection was successfully established with the server, but
then an error occurred during the pre-login handshake. (provider: TCP Provider,
error: 0 - An existing connection was forcibly closed by the remote host.)
If you try again it usually works.
After reading a lot of posts on this I added transient error handling to the code/resilience using EnableRetryOnFailure() to the DB connection.
I also tried adding Trusted_Connection=False to the connection string.
After this the you could see the connection re-trying multiple times until it worked, sometimes taking 20 seconds or more. Still, maybe 1 in 100 connections it eventually fails with the same error.
We also looked at the TLS_DHE bug https://learn.microsoft.com/en-us/troubleshoot/windows-server/identity/apps-forcibly-closed-tls-connection-errors but the TLS_DHE ciphers are not installed on the server at all.
There's nothing in the event logs on the Windows server, or in the database logs at the time of the error.
Recent changes in the infrastructure: Panda antivirus, moved web apps to a different Azure region.
I've been reading posts on this for days now, mostly really old and slightly different. I'm looking for any ideas of things to try to pinpoint the error. Thanks.
edit: I found some event logs in Microsoft/ServiceBus/Client
HybridConnectionManager Trace: Microsoft.Azure.Relay.RelayException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.WebSockets.WebSocketException: An internal WebSocket error occurred. Please see the innerException, if present, for more details. ---> System.IO.IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. ---> System.Net.Sockets.SocketException: An existing connection was forcibly closed by the remote host
at System.Net.Sockets.Socket.EndReceive(IAsyncResult asyncResult)
at System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult)
--- End of inner exception stack trace ---

Well, this took three months to resolve and it involved our network support team, AWS support, and Azure support.
I've come back three times to edit this answer. The solution returned on a different server so we tried the fixes that worked on one and they didn't work!
In Azure Relay/Hyrbid connections, under the connection in question we saw there were TWO listeners, when there should only be one. Each Hybrid Connection Manager you install and connect shows up there as a listener.
So where was the second listener? Nowhere. It seemed to be a hanging orphan link from a previously deleted connection.
The only way to delete the phantom listener was to
uninstall HCM on the database server
remove the connection from all azure apps using it
delete the hybrid connection completely in azure
recreate the connection in azure afresh
reconnect the apps
reinstall HCM on the database server
connect HCM to the new hybrid connection
After this we showed one listener under the connection in Azure, and things worked immediately.
When you have two listeners the data is load balanced between them, so in my case half the time the data was being routed to a non-existent listener and failing. This is why no logs appeared on the database server - it wasn't getting there at all!

Related

Azure Hybrid Connection SQL Connection stops working. Restart of Web App helps. Problem with EnableRetryOnFailure?

I have an Azure Time stamping Web App connecting to about 50 different customers On Premise SQL databases using Hybrid connection. The web app is an Asp.Net Core 2.2 based C# app using EF.
The app has been working extremely well over 2 years without any errors, but the recent week the SQL connections has during two different times stopped working. They will start immediately after a restart of the web app. This is of course extremely bad for my customers who need 24/7 working.
The Azure failure messages indicate a problem with opening a SQL connection, which comes over time and is resetted by a web restart.
When studying the SQL failures in Azure all gives the same error number 10013 in trying to open the connection:
System.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections.
(provider: TCP Provider, error: 0 - An attempt was made to access a socket in a way forbidden by its access permissions. Error Number:10013,State:0,Class:20
When more carefully examining the failures, it turns out that the first failure starts with a lot of 64-number errors as below (5-8) after which the succeeding errors are of type 10013 for all connections.
System.Data.SqlClient.SqlException (0x80131904): A connection was successfully established with the server, but then an error occurred during the pre-login handshake.
(provider: TCP Provider, error: 0 - The specified network name is no longer available.
Error Number:64,State:0,Class:20
My thoughts are: could this be because of using EnableRetryOnFailure as an sqlOption, and for some reason this will congest all SqlOpen tries in case of a badly working single connection.
Code used in the web app:
DbContextOptions dbConnOptions = SqlServerDbContextOptionsExtensions
.UseSqlServer(new DbContextOptionsBuilder(), dbConnectionString,
sqlServerOptionsAction: sqlOptions =>
{
sqlOptions.EnableRetryOnFailure(
maxRetryCount: 10,
maxRetryDelay: TimeSpan.FromSeconds(30),
errorNumbersToAdd: null);
}).Options;
Could this be the reason to my SQL connection problems and would it be better to just let remove the
EnableRetryOnFailure code?
Or could there be a completely other solution to the problem?
Bengt Bredenberg,
Premisol Oy, Helsinki, Finland

Azure app service with hybrid connection can't access on prem SQL Server

I have an app service with hybrid connection enabled(on a VM in the same network with the SQL Sever) for me to access on prem SQL Server, which I don't own. However, the connectivity has been pretty unstable.
I am able to access to the SQL Server probably for maybe 5% of the tries and mostly I just get error
One or more errors occurred. (A connection was successfully established with the server, but then an error occurred during the pre-login handshake. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.
I'm able to log in thru SSMS on the VM. Connection string should be alright since I can access locally(local network is the same network as the SQL Server).
Named instances use dynamic ports and UDP, which are not supported by Hybrid Connections. Suggest you using static port as Nick mentioned. Please refer to this document Connect to on-premises SQL Server from a web app in Azure App Service using Hybrid Connections to get more information.
provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.
What is your Windows version? When Windows versions that don't contain the leading zero fixes for TLS_DHE will display this error message. You can try to update windows version or disable the TLS_DHE ciphers to solve this issue. Please refer to this document.

Datasource verification problems after Windows updates

Yesterday windows updates were installed on my laptop, and afterwards many features of ColdFusion were out of configuration.
I am using ColdFusion 2016 and SQL Server 2016 RC.
I fixed a number of issues (see below) but still get the message
Connection verification failed for data source: MT_EL
java.sql.SQLNonTransientConnectionException: [Macromedia][SQLServer
JDBC Driver]Error establishing socket to host and port: 8500:1433.
Reason: Network is unreachable: connect The root cause was that:
java.sql.SQLNonTransientConnectionException: [Macromedia][SQLServer
JDBC Driver]Error establishing socket to host and port: 8500:1433.
Reason: Network is unreachable: connect.
The DSNs had been verifying for at least a year before the problems occurred.
So far I have done the following:
Both SQL Server and CF Server had to be started again. SQL Server was not a problem but the CF Server would not start. I went to the jvm.config file and reduced the -xms setting. This did not solve anything, so I looked at the logs. From the logs it was apparent that the neo-security.xml file was corrupted, and upon checking I saw that neo-security.xml was now empty. neo-datasource, neo-drivers and one or two other files were also empty. The back-ups of these files were also empty, but I found some old versions in another place, and copied them over. Now I was able to start the CF Server and get into the CF Administrator, but had to set up user names/passwords and also DSNs again.
SQL Server Configuration Manager had been moved to a different folder, but I found it and soon saw an error message saying that SQL Server Configuration Manager could not connect to the wmi provider. I fixed this by opening a command prompt in administrator mode and typingmofcomp "%programfiles(x86)%\Microsoft SQL Server\13\Shared\sqlmgmproviderxpsp2up.mof".
Now I could get into SQL Server Configuration Manager, but for some reason it is listed twice, The malfunctioning one still says I cannot connect to the wmi provider, but expanding the functioning one, I found that TCP/IP is enabled and the default port is 1433.
I checked the firewall and could not see any issues there.
SQL permissions + log in/password credentials are the same as before, when there were no DSN verification problems.
I have tried ports 8501 and 8502, but the above error persists.
I have checked the SQL Server logs. It is apparent that a number of errors occurred yesterday and certain features were disabled. However it is evident that these issues have now been resolved, and the most recent messages are of informational type and state that no user action is necessary.
Anyone any ideas? Thank you in advance for any comments/assistance.

An existing connection was forcibly closed by the remote host - Intermitent

I have a SQL Server 2014 hosted on Windows server 2012.
I also have many windows services develop in c# that run on Windows Server 2012.
My services have different responsabilities... They then connect on different databases on the server mentionned above.
Sometimes, in a realy intermitent manner, one of the service gets the following SqlException... While they other are still working fine...
Message: The client was unable to establish a connection because of an error during connection initialization process before login. Possible causes include the following: the client tried to connect to an unsupported version of SQL Server; the server was too busy to accept new connections; or there was a resource limitation (insufficient memory or maximum allowed connections) on the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)
I googled for some troubleshooting info with no luck...
What appears to be strange, it that two service working on the same database on the same server but only one gets the error...

IIS + Kerberos + SQL Server + EF Initial connection failure

I have a web server on my domain that I'm trying to use Kerberos delegation to allow access to my SQL Server. They are all Server 2008 R2 servers with IIS 7.5 and SQL 2008 R2 (the DC is also Server 2008 R2).
Everything is working, in that I see transactions being executed on my SQL Server under the user's account. However, the first time I access the site after an extended period of time (30 mins or so) I get the following error thrown by my EF DataContext object:
Exception: The underlying provider failed on Open
at System.Data.EntityClient.EntityConnection.OptenStoreConnectionIf...
Inner Exception: A network-related or instance-specific error occurred while
establishing a connection to SQL Server. The server was not found or was not
accessible. Verify that the instance name is correct and that SQL Server is
configured to allow remote connections. (provider: Named Pipes Provider,
error: 40 - Could not open a connection to SQL Server)
Inner Inner Exception: The system cannot find the file specified
The error page takes ~20 to 30 seconds to be served. After receiving this error, if I hit refresh in my browser, I get the page with all of the data almost instantly (around 200ms)
What would be causing this initial connection to fail, but all subsequent connections to succeed?
Misc information:
EF 6.0
IIS 7.5, Windows Auth & APS.NET Impersonation enabled, Extended Protection Off, Kernal-mode auth Off, Providers - Negotiate:Kerberos
AppPool uses service account (all SPNs are registered to that account)
If there is any more information that you need, let me know and I'll update this list!
UPDATE:
After doing several network traces, I'm seeing the following pattern:
HTTP Request 1
6 frames of KerberosV5 traffic
HTTP Response: No SQL Data
HTTP Request 2
2 frames of KerberosV5 traffic
TDS Prelogin
TDS Response
2 more frames KerberosV5 traffic (TGS MSSQLSvc request and response)
6 frames of TDS Traffic (SQL Data)
HTTP Response: Success!!
I'm thinking this is a kerberos issue...
I can't really tell what is causing your issue, but here is a tip on how you can deal with it, just in case you don't manage to find the cause:
EF CodePlex Link on Connection Resiliency
MSDN article on Connection Resiliency
This is feature introduced with Entity Framework 6.x. By the default, when EF encounters issue that you've brought up, it will throw an exception and then if you want to have a retry, you must write quite a messy code and duplicate it everywhere.
With Connection Resiliency, you're able to write DbExecutionStrategy that suits you the best. DbExecutionStrategy has a method that you can override that gives you ability to decide whether the query should be executed again once specific Exception type occurs. For the executing code and end user, this would just look like slight delay in execution, no error would appear.
From my personal experience, what you see now can be caused by many things, including some setting on your hosting provider (if you're not hosting it on premises). I'd look look into SQL logs or Event Viewer to see if SQL is from some reason going to a state where it is not available.

Resources