I'm working on a .NET application migration from Oracle to SQL Server database. The application was developed in the 2000s by a third party, so we intend to modify it as little as possible in order to avoid introducing new bugs.
I replaced the Oracle references to SqlClient ones (OracleConnection to SqlConnection, OracleTransaction to SqlTransaction etc.) and everything worked fine. However, I'm having trouble with a logic that tries to reconnect to the DB in case of errors.
If a problem occurs when trying to read/write to the database, method TryReconnect is called. This method checks whether the Oracle exception number is 3114 or 12571; if so, it tries to reopen the connection.
I checked these error codes:
ORA-03114: Not Connected to Oracle
ORA-12571: TNS: packet writer failure
I searched for the equivalent error codes for SQL Server but I couldn't find them. I checked the MSSQL and .NET SqlClient documentation but I'm not sure that any of those is equivalent to ORA-3114 and ORA-12571.
Can somebody help me deciding which error numbers should be checked in this logic? I thought about checking for codes 0 (I saw it happen when I stopped the database to force an error and test this) and -2 (Timeout expired), but I'm not really sure about it.
The behavior is different. You can't base your SQL Server retry logic on Oracle semantics. For starters, SqlConnection will retry to connect even in the old System.Data.SqlClient library. Its replacement, Microsoft.Data.SqlClient includes configurable retry logic to handle connections to cloud databases from on-premise applications, eg an on-prem application connecting to Azure SQL. This retry logic is on by default in the current RTM version , 3.0.0.
You can also look at high-level resiliency libraries like Polly, a very popular resiliency package that implements recovery strategies like retries with backoff, circuit breakers etc. This article describes Cadru.Polly which contains strategies for handling several SQL Server transient faults. You could use this directly or you can handle the transient error numbers described in that article:
Exception Handling Strategy
Errors Handled
SqlServerTransientExceptionHandlingStrategy
40501, 49920, 49919, 49918, 41839, 41325, 41305, 41302, 41301, 40613, 40197, 10936, 10929, 10928, 10060, 10054, 10053, 4221, 4060, 12015, 233, 121, 64, 20
SqlServerTransientTransactionExceptionHandlingStrategy
40549, 40550
SqlServerTimeoutExceptionHandlingStrategy
-2
NetworkConnectivityExceptionHandlingStrategy
11001
Polly allows you to combine policies and specify different retry strategies for them, eg :
Using a cached response in some cases (lookup data?)
Retrying with backoff (even random delays) in other cases (deadlocks?). Random delays can be very useful if you run into timeouts because too many concurrent operations cause deadlocks or timeouts. Without it, all failing requests would retry at the same time, causing yet another failure
Using a circuit breaker to switch to a different service or server.
You could create an Oracle strategy so you can use Polly throughout your projects and handle all recoverable failures, not just database retries.
Related
What does it do? How does it work? Why am I supposed to test the database connection before "borrowing it from the pool"?
I was not able to find any related information as to why I should be using it. Just how to use it. And it baffles me.
Can anyone provide some meaningful definition and possibly resources to find out more?
"test-on-borrow" indicates that a connection from the pool has to be validated usually by a simple SQL validation query defined in "validationQuery". These two properties are commonly used in conjunction to make sure that the current connections in the pool are not stale (no longer connected to the DB actively as a result of a DB restart, or timeouts enforced by the DB, or whatever other reason that might cause stale connections). By testing the connections on borrow, the application can automatically reconnect to the DB using new connections (and dropping the invalid ones) without a manual restart of the app and thus preventing DB connection errors in the app.
You can find more information on jdbc connection pool attributes here:
https://tomcat.apache.org/tomcat-8.0-doc/jdbc-pool.html#Common_Attributes
When NHibernate’s log level is set to “DEBUG”, we start seeing a bunch of “Internal Connection Fatal” errors in our logs. It look like NHibernate dies about ½ way through processing a particular result set. According to the logs, the last column NHibernate reads appears to have garbage in it that isn’t in the underlying data.
The issue seems to go away when either:
The log level is set back to “ERROR”.
The view being queried is changed to return less data (either less rows OR null or blank values for various columns).
We’re using ASP.NET MVC, IIS7, .NET Framework 4.5, SQL Server 2012, log4net.2.0.2 and NHibernate.3.3.3.4001.
I guess my real concern is that there is some hidden issue with the code that the added strain of logging is bringing to light, but I'm not sure what it could be. I have double checked the NHibernate mappings and they look good. I've also checked to ensure that I'm disposing of the NHibernate session at the end of each request. I also tried bumping up the command timeout, which didn't seem to make a difference.
If one of the columns is of non-simple type (binary, text, etc.) nh may be having problems populating a property.
Turns out the connection from our dev app server to our dev database server was wonky.
From the dev app server, open SSMS and try to connect to the dev database server.
Sometimes we get the "internal connection fatal error", sometimes we don't.
Issue is possibly caused by TCP Chimney Offload/SQL Server incompatibility.
Check the following KB Article for possible solutions:
http://support.microsoft.com/kb/942861
For Windows 7/2008 R2:
By default, the TCP Chimney Offload feature is set to Auto. This means
that the chimney does not offload all connections. Instead, it
selectively offloads the connections that meet the following criteria:
The connection is established through a 10 gigabits per second (Gbps)
Ethernet adapter.
The mean round-trip link latency is less than 20
milliseconds.
At least 130 kilobytes (KB) of data were exchanged over
the connection.
Last condition is triggered in the middle of dataset, so you see garbage instead of real data.
We are running a website on a vps server with sql server 2008 x64 r2. We are being bombarded with 17886 errors - namely:
The server will drop the connection, because the client driver has
sent multiple requests while the session is in single-user mode. This
error occurs when a client sends a request to reset the connection
while there are batches still running in the session, or when the
client sends a request while the session is resetting a connection.
Please contact the client driver vendor.
This causes sql statements to return corrupt results. I have tried pretty much all of the suggestions I have found on the net, including:
with mars, and without.
with pooling and without
with async=true and without
we only have one database and it is absolutely multi-user.
Everything has been installed recently so it is up to date. They may be correlated with high cpu (though not exclusively according to the monitors I have seen). Also correlated with high request rates from search engines. However, high cpu/requests shouldn't cause sql connections to reset - at worst we should have high response times or iis refusing to send response.
Any suggestions? I am only a developer not dba - do i need a dba to solve this problem?
Not sure but some of your queries might cause deadlocks on the server.
At the point you detect this error again
Open Management Studio (on the server, install it if necessary)
Open a new query window
Run sp_who2
Check the blkby column which is short for Blocked By. If there is any data in that column you have a deadlock problem (Normally it should be like the screenshot I attached, completely empty).
If you have a deadlock then we can continue with next steps. But right now please check that.
To fix the error above, ”MultipleActiveResultSets=True” needs to be added to the connection string.
via Event ID 17886 MSSQLServer – The server will drop the connection
I would create an eventlog task to email you whenever 17886 is thrown. Then go immediately to the db and execute the sp_who2, get the blkby spid and run a dbcc inputbuffer. Hopefully the eventinfo will give you something a bit more tangible to go on.
sp_who2
DBCC INPUTBUFFER(62)
GO
Use a "Instance Per Request" strategy in your DI-instantiation code and your problem will be solved
Most probably you are using dependency injection. During web development you have to take into account the possibility of concurrent requests. Therefor you have to make sure every request gets new instances during DI, otherwise you will get into concurrency issues. Don't be cheap by using ".SingleInstance" for services and contexts.
Enabling MARS will probably decrease the number of errors, but the errors that are encountered will be less clear. Enabling MARS is always never the solution, do not use this unless you know what you're doing.
I have 3 servers set up for SQL mirroring and automatic failover using a witness server. This works as expected.
Now my application that connects to the database, seems to have a problem when a failover occurs - I need to manually intervene and change connection strings for it to connect again.
The best solution I've found so far involves using Failover Partner parameter of the connection string, however it's neither intuitive nor complete: Data Source="Mirror";Failover Partner="Principal" found here.
From the example in the blog above (scenario #3) when the first failover occurs, and principal (failover partner) is unavailable, data source is used instead (which is the new principal). If it fails again (and I only tried within a limited period), it then comes up with an error message. This happens because the connection string is cached, so until this is refreshed, it will keep coming out with an error (it seems connection string refreshes ~5 mins after it encounters an error). If after failover I swap data source and failover partner, I will have one more silent failover again.
Is there a way to achieve fully automatic failover for applications that use mirroring databases too (without ever seeing the error)?
I can see potential workarounds using custom scripts that would poll currently active database node name and adjust connection string accordingly, however it seems like an overkill at the moment.
Read the blog post here
http://blogs.msdn.com/b/spike/archive/2010/12/15/running-a-database-mirror-setup-with-the-sqlbrowser-service-off-may-produce-unexpected-results.aspx
It explains what is happening, the failover partner is actually being read from the sql server not from your config. Run the query in that post to find out what is actually being used as the failover server. It will probably be a machine name that is not discoverable from where your client is running.
You can clear the application pool in the case a failover has happened. Not very nice I know ;-)
// ClearAllPools resets (or empties) the connection pool.
// If there are connections in use at the time of the call,
// they are marked appropriately and will be discarded
// (instead of being returned to the pool) when Close is called on them.
System.Data.SqlClient.SqlConnection.ClearAllPools();
We use it when we change an underlying server via SQL Server alias, to enforce a "refresh" of the server name.
http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlconnection.clearallpools.aspx
The solution is to turn connection pooling off Pooling="false"
Whilst this has minimal impact on small applications, I haven't tested it with applications that receive hundreds of requests per minute (or more) and not sure what the implications are. Anyone care to comment?
Try this connectionString:
connectionString="Data Source=[MSSQLPrincipalServerIP,MSSQLPORT];Failover Partner=[MSSQLMirrorServerIP,MSSQLPORT];Initial Catalog=DatabaseName;Persist Security Info=True;User Id=userName; Password=userPassword.; Connection Timeout=15;"
If you are using .net development, you can try to use ObjAdoDBLib or PigSQLSrvLib and PigSQLSrvCoreLib, and the code will become simple.
Example code:
New object
ObjAdoDBLib
Me.ConnSQLSrv = New ConnSQLSrv(Me.DBSrv, Me.MirrDBSrv, Me.CurrDB, Me.DBUser, Me.DBPwd, Me.ProviderSQLSrv)
PigSQLSrvLib or PigSQLSrvCoreLib
Me.ConnSQLSrv = New ConnSQLSrv(Me.DBSrv, Me.MirrDBSrv, Me.CurrDB, Me.DBUser, Me.DBPwd)
Execute this method to automatically connect to the online database after the mirror database fails over.
Me.ConnSQLSrv.OpenOrKeepActive
For more information, see the relevant links.
https://www.nuget.org/packages/ObjAdoDBLib/
https://www.nuget.org/packages/PigSQLSrvLib/
https://www.nuget.org/packages/PigSQLSrvCoreLib/
I'm testing ELMAH and have deliberately turned off the database connection for the ELMAH log in my application to see what will happen in production if the DB isn't available.
It seems that ELMAH can't trap its own errors- the AXD file isn't available when the SQL databse log fails.
What is the intended behavior of ELMAH if the database isn't available?
How can I diagnose my errors if this occurs?
It seems that ELMAH can't trap its own
errors
ELMAH does trap its own errors to some extent. If the ErrorLogModule encounters an exception while attempting to log the error then the exception resulting from logging is sent to the standard .NET Framework trace facility. See line 123 from 1.0 sources. See also the following walk-through from the ASP.NET documentation for getting the standard .NET Framework tracing working with ASP.NET tracing:
Walkthrough: Integrating ASP.NET Tracing with System.Diagnostics Tracing
the AXD file isn't available when the
SQL databse log fails.
That is correct. SQL Server database connectivity must be functional to view errors stored in a SQL Server database when using SqlErrorLog.
What is the intended behavior of ELMAH
if the database isn't available?
If, for example, the SQL Server database is down, a SqlException will occur during logging. ELMAH will then send the SqlException object content to the standard .NET Framework trace facility.
How can I diagnose my errors if this
occurs?
The best option here is to also enable logging and e-mailing of errors. If the database is down, chances are good that the mail gateway is up and you will still get notified of errors. The errors will, in effect, get logged in some mailbox(es). This also has the added advantage that if the mail gateway is ever down then chances are that the database will be up and errors will get logged there. If both are down, however, then you will need to seriously review your production infrastructure and possibly take measures for monitoring health of your system via additional measures.
Not really sure about ELMAH but expected behaviour of such logging frameworks is not to throw any exceptions if something goes wrong with them. I.e. if ELMAH's database is down I'd assume it will just not log the errors to database.
As suggested above you can/should use alternative sinks - email or flat file.
You can always use the xml file option to log your errors.
I think you're mixing up contexts a bit.
ELMAH's behavior if the database isn't available is to not log errors to the database. If an exception is thrown on the server or if you raise an exception via an ErrorSignal, ELMAH is going to let that exception pass through to either a yellow screen or a custom errors page (your setting.)
Since Errors.axd page is only accessible to those that should be seeing it (ideally,) it is okay to present that error to the user.
The bottom line is that if the errors database is down you can't diagnose errors. For us, if that were the case, we'd have bigger problems since the error database sits with the production database.
I would also advocate against using XML logging for your primary logging source. SQL server is going to give you the best performance without having to manage the files. With XML logging that is not the case.