We have a website running on a server. We have a "production" instance and a "staging" instance each having its own database. The MSSQL Server is running locally on the same server.
Today, suddenly the "production" website went down. Looking at the logs, the following exception showed up:
System.Data.Entity.Core.EntityException: The underlying provider failed on Open. ---> System.InvalidOperationException: Timeout expired. The timeout period elapsed prior to obtaining a connection from the pool. This may have occurred because all pooled connections were in use and max pool size was reached.
At the same time, the "staging" website was working just normally.
While trying to figure out what was happening, I tried all sorts of things like re-creating both the app pool and the IIS app. I also hooked up the "production" IIS app to the same app pool of the "staging" app, still the same issue. Restarted the server too of course.
Also, I ran the executable of the "production" website directly (as a console app) and it worked normally. So it's a problem that happens only when running under IIS.
One last thing I tried, is that I reconfigured the "staging" website to use the "production" database, and to my utter shock it worked normally. Because I thought the problem was the "production" database itself.
I just have no idea whatsoever about what's going on here. Any help is very much appreciated.
If all the connections in the connection pool are used, it is almost certainly because your application is opening database connections and failing to close them.
Since you are using Entity Framework, it's probably because your application is failing to dispose of the DbContext object.
It's nothing to do with the production database as such; probably the increased activity on your production site vs your staging site is making the application bug manifest itself more quickly.
Related
I am upgrading our Airflow instance from 1.9 to 1.10.3 and whenever the scheduler runs now I get a warning that the database connection has been invalidated and it's trying to reconnect. A bunch of these errors show up in a row. The console also indicates that tasks are being scheduled but if I check the database nothing is ever being written.
The following warning shows up where it didn't before
[2019-05-21 17:29:26,017] {sqlalchemy.py:81} WARNING - DB connection invalidated. Reconnecting...
Eventually, I'll also get this error
FATAL: remaining connection slots are reserved for non-replication superuser connections
I've tried to increase the SQL Alchemy pool size setting in airflow.cfg but that had no effect
# The SqlAlchemy pool size is the maximum number of database connections in the pool.
sql_alchemy_pool_size = 10
I'm using CeleryExecutor and I'm thinking that maybe the number of workers is overloading the database connections.
I run three commands, airflow webserver, airflow scheduler, and airflow worker, so there should only be one worker and I don't see why that would overload the database.
How do I resolve the database connection errors? Is there a setting to increase the number of database connections, if so where is it? Do I need to handle the workers differently?
Update:
Even with no workers running, starting the webserver and scheduler fresh, when the scheduler fills up the airflow pools the DB connection warning starts to appear.
Update 2:
I found the following issue in the Airflow Jira: https://issues.apache.org/jira/browse/AIRFLOW-4567
There is some activity with others saying they see the same issue. It is unclear whether this directly causes the crashes that some people are seeing or whether this is just an annoying cosmetic log. As of yet there is no resolution to this problem.
This has been resolved in the latest version of Airflow, 1.10.4
I believe it was fixed by AIRFLOW-4332, updating SQLAlchemy to a newer version.
Pull request
So I posted a new blog on my site and promoted it on my facebook where the traffic spike was far bigger than anticipated, the server went down from the volume of traffic and after it was rebooted I am now getting a database connection error.
I contacted my server host and they told me this:
"I was able to get the relevant database details from the wp-config.php file in the home directory for your site and, using those creds I am able to connect to the relevant database without a problem.
To be sure that I was able to connect AND make a query to the database I have also created a simple test script that can be viewed at http://yoursite.com/mysqltest.php
This confirms that the server is responding correctly and that the database itself is able to accept connections and queries.
This leaves us with the likelihood that the issue lies with the scripting/configuration of the wordpress installation which is not something I am going to be able to assist you with.
I suspect that the problem lies with the wp-config.php file but cannot be certain."
I can't see how the wp-config would have changed, I haven't touched it in over a month and it's been working fine otherwise. The website was also working fine after I posted that blog, it was only after the server was rebooted that it doesn't. All the other sites on the server remain in perfect working condition. I don't see how a traffic spike could have done this. I'm lost as to what to do next? Please help! :(
D
Try this database connection test script https://gist.github.com/162913
I have an MVC3 application hosted by third party hosting provider. The site has been running well for the past 3 months without any problems. Today suddenly the Application started throwing following Exception as recorded in my logs part of which is shown below.
System.Data.ProviderIncompatibleException: The provider did not return
a ProviderManifestToken string. --->
System.Data.SqlClient.SqlException: Timeout expired. The timeout
period elapsed prior to completion of the operation or the server is
not responding.
The message is self explanatory and I first thought I should increase the connect timeout, but then the exception was still thrown suggesting the other part (Server Not Responding). I contacted my hosting provider and he said there was nothing wrong on his part. So I am stuck with a down website and don't know what to do.
Any ideas why the provider is throwing the exception listed above. Also, is it possible for me to remotely connect to the database on the hosting server with limited authority. Any tools for that ? I don't have an exposure in database subject, except for application programming.
This occurs due to the Timeout, the default timeout is 30 seconds, for time out there are 2 common reasons.
Long running tasks or uncommitted transactions. Refer to the Timeout expired to know about this.
I've recently switched to running my development environment over our company's VPN using NetExtender. It would now seem that my database driven applications are now timing out the first time they try to hit the database. After the timeout (30 sec or so) and an additional 5-10 seconds, all DB calls succeed. During the 5-10 seconds the timeout error response is sent immediately. It seems to be related to when SQL Server needs to create a new database session for me. Each time I need to be assigned a new client process ID, I timeout. This is a huge problem when using Resharper + NUnit as a test harness as each time the tests are run, a new instance of resharper's unit test runner is created thusly causing me to timeout. Server timeout seems to be in the area of 30 seconds, which is certainly generous enough for a connection to be established.
It sounds to me like it could be a DNS issue. If the primary DNS is not properly configured and is inaccessible from the VPN client, it will timeout and pass on to the secondary.
Additionally, some VPNs allow you to access some local resources - this could put the DNS on your own, local network in play.
I think I'd try changing the DNS-order and see if that did the trick.
Context: The Cloud
We have a java-based web application that we normally host on our own servers. Recently we used Amazon Web Services (AWS EC2) cloud to host an instance.
This "cloud setup" matches our typical "on site" setup: one server for the app server, another server for the database server. (Several app servers point to the same database server)
The problem
In this cloud setup, we receive intermittent "connection reset by peer errors" between the database and the jdbc driver, where at (seemingly) random intervals and at random points in the codebase, the database connection fails.
Here are a few error excerpts for the log
Stack Trace Example 1:
at com.participate.pe.genericdisplay.client.taglib.GenDisplayViewTag.doStartTag(GenDisplayViewTag.java:77)
... 75 more
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:170)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.checkClosed(SQLServerConnection.java:304)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.getMetaData(SQLServerConnection.java:1734)
at org.jboss.resource.adapter.jdbc.WrappedConnection.getMetaData(WrappedConnection.java:354)
Stack Trace Example 2
at java.lang.Thread.run(Thread.java:619)
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset
at com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1368)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1355)
at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:1532)
at com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:3274)
at com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:4437)
at com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:4389)
at com.microsoft.sqlserver.jdbc.SQLServerConnection$1ConnectionCommand.doExecute(SQLServerConnection.java:1457)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:4026)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1416)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectionCommand(SQLServerConnection.java:1462)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.setAutoCommit(SQLServerConnection.java:1610)
at org.jboss.resource.adapter.jdbc.BaseWrapperManagedConnection.checkTransaction(BaseWrapperManagedConnection.java:429)
Technical Environment
Jboss 4.2.2.GA (Jboss-Web 2.0/ Tomcat 6)
MSSQL 2005 2.0 jdbc driver
Some points
We have never seen this problem in
our own environment (i.e. own data centers) running the application for several years
This led me to conclude "something funny is going on with Amazon network environment". I may be wrong/missing something/etc.
This problem only occurs with our application. We have other java and php applications which have not had this problem. The other java application uses a different jdbc driver (jtds, afaik)
It doesn't seem like a simple connection timeout
Questions
-Has anyone seen this before?
-If it's an EC2 "known issue", can we configure our way around the problem (i.e. make sure everything is on its own subnet or virtual private cloud (vpc) ?
-Any jdbc driver settings to get past this problem?
** Update **
I've extended and increased the bounty on this question.
On extra bit of information: the two virtual servers (database and application server) were on different subnets--i.e. one hop between the two servers.
In a non-cloud environment we have "zero hops" bewtewn the two servers.
Our hosting admins said we had no control over the subnets of our EC2 instances. This made me wonder if virtual private cloud would help.
thanks in advance
will
Not sure if this is related or not. We experienced something similar with an app that we were running in the EC2 environment. Same symptom, that the database connection would intermittently close. We were using MSSQL 1.2 driver. Also, we would see the errors usually after a delay or idle time with the connection. Our assumption (never proven) was that something in the network layer was closing the connection and the client wasn't detecting it, so it became stale.
We were able to work around it because we were using commons connection pools, and had the pool recreate the connection on failure. We eventually moved the application out of EC2 and didn't see the issue again.
Just a word of caution on usind DBCP/connection pool features to mitigate the issue - the more you enable 'testOnBorrow' and other features, the more you can introduce latency or other performance changing affects on the system. I don't know if DBCP still does this or not, but a few years ago it would generate actual test queries to test the connection - full stack, database responses - not just at the network layer. The above link from Brian brings back horrific memories from the early 2000s on surrounding re-try logic for JDBC connection management.
Anyway, it's tough to really root cause this, other than gather evidence and eliminate the 'seemingly random' to a specific set of conditions:
You could try to throw up a Wireshark/PCAP trace, find when it happens, and send the results to both Amazon and Microsoft to see if they can root cause it
You could try the above with certain test harnesses to isolate the problem (JMeter tests to get concurrency up), bounce the network connection, watch for recovery, etc
You could try alternative versions of SQL Server to discount a SQL Server/JDBC driver bug that has since been fixed.
If DNS is used in connection strings, could use IP addresses to validate nslookup issues
I'm not a SQL Server expert, but another route for research could be within the related products domain - e.g. see if anyone experienced similar issues with TFS/Sharepoint (e.g. such as http://nickhoggard.wordpress.com/2009/12/07/further-experiences-with-tfs-2010-beta-2-on-amazon-ec2/ )
I have seen this issue in both the EC2 environment and the Windows Azure environment. I think connection retry logic needs to be a standard part of your design when working in a distributed computing environment.
This article is for SQL Azure - but I think it equally applies to EC2 and all drivers.
I can also confirm that this happens and will spin up a lower priority investigation since it's not production critical.
Our production servers are in our data center. We use developer laptops to run our applications. Neither of these get this issue once we configured c3p0 connection pool timeouts and test period (see article: http://www.codefin.net/2007/05/hibernate-and-mysql-connection-timeouts.html).
However, we do have a development staging server that is in EC2 and it does indeed happen there. If I find something that seems to work, I'll ping back. Also, I'm using mysql. I see that you are using MS SQL Server so it is across database vendors.