Context: The Cloud
We have a java-based web application that we normally host on our own servers. Recently we used Amazon Web Services (AWS EC2) cloud to host an instance.
This "cloud setup" matches our typical "on site" setup: one server for the app server, another server for the database server. (Several app servers point to the same database server)
The problem
In this cloud setup, we receive intermittent "connection reset by peer errors" between the database and the jdbc driver, where at (seemingly) random intervals and at random points in the codebase, the database connection fails.
Here are a few error excerpts for the log
Stack Trace Example 1:
at com.participate.pe.genericdisplay.client.taglib.GenDisplayViewTag.doStartTag(GenDisplayViewTag.java:77)
... 75 more
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: The connection is closed.
at com.microsoft.sqlserver.jdbc.SQLServerException.makeFromDriverError(SQLServerException.java:170)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.checkClosed(SQLServerConnection.java:304)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.getMetaData(SQLServerConnection.java:1734)
at org.jboss.resource.adapter.jdbc.WrappedConnection.getMetaData(WrappedConnection.java:354)
Stack Trace Example 2
at java.lang.Thread.run(Thread.java:619)
Caused by: com.microsoft.sqlserver.jdbc.SQLServerException: Connection reset
at com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1368)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:1355)
at com.microsoft.sqlserver.jdbc.TDSChannel.read(IOBuffer.java:1532)
at com.microsoft.sqlserver.jdbc.TDSReader.readPacket(IOBuffer.java:3274)
at com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:4437)
at com.microsoft.sqlserver.jdbc.TDSCommand.startResponse(IOBuffer.java:4389)
at com.microsoft.sqlserver.jdbc.SQLServerConnection$1ConnectionCommand.doExecute(SQLServerConnection.java:1457)
at com.microsoft.sqlserver.jdbc.TDSCommand.execute(IOBuffer.java:4026)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.executeCommand(SQLServerConnection.java:1416)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.connectionCommand(SQLServerConnection.java:1462)
at com.microsoft.sqlserver.jdbc.SQLServerConnection.setAutoCommit(SQLServerConnection.java:1610)
at org.jboss.resource.adapter.jdbc.BaseWrapperManagedConnection.checkTransaction(BaseWrapperManagedConnection.java:429)
Technical Environment
Jboss 4.2.2.GA (Jboss-Web 2.0/ Tomcat 6)
MSSQL 2005 2.0 jdbc driver
Some points
We have never seen this problem in
our own environment (i.e. own data centers) running the application for several years
This led me to conclude "something funny is going on with Amazon network environment". I may be wrong/missing something/etc.
This problem only occurs with our application. We have other java and php applications which have not had this problem. The other java application uses a different jdbc driver (jtds, afaik)
It doesn't seem like a simple connection timeout
Questions
-Has anyone seen this before?
-If it's an EC2 "known issue", can we configure our way around the problem (i.e. make sure everything is on its own subnet or virtual private cloud (vpc) ?
-Any jdbc driver settings to get past this problem?
** Update **
I've extended and increased the bounty on this question.
On extra bit of information: the two virtual servers (database and application server) were on different subnets--i.e. one hop between the two servers.
In a non-cloud environment we have "zero hops" bewtewn the two servers.
Our hosting admins said we had no control over the subnets of our EC2 instances. This made me wonder if virtual private cloud would help.
thanks in advance
will
Not sure if this is related or not. We experienced something similar with an app that we were running in the EC2 environment. Same symptom, that the database connection would intermittently close. We were using MSSQL 1.2 driver. Also, we would see the errors usually after a delay or idle time with the connection. Our assumption (never proven) was that something in the network layer was closing the connection and the client wasn't detecting it, so it became stale.
We were able to work around it because we were using commons connection pools, and had the pool recreate the connection on failure. We eventually moved the application out of EC2 and didn't see the issue again.
Just a word of caution on usind DBCP/connection pool features to mitigate the issue - the more you enable 'testOnBorrow' and other features, the more you can introduce latency or other performance changing affects on the system. I don't know if DBCP still does this or not, but a few years ago it would generate actual test queries to test the connection - full stack, database responses - not just at the network layer. The above link from Brian brings back horrific memories from the early 2000s on surrounding re-try logic for JDBC connection management.
Anyway, it's tough to really root cause this, other than gather evidence and eliminate the 'seemingly random' to a specific set of conditions:
You could try to throw up a Wireshark/PCAP trace, find when it happens, and send the results to both Amazon and Microsoft to see if they can root cause it
You could try the above with certain test harnesses to isolate the problem (JMeter tests to get concurrency up), bounce the network connection, watch for recovery, etc
You could try alternative versions of SQL Server to discount a SQL Server/JDBC driver bug that has since been fixed.
If DNS is used in connection strings, could use IP addresses to validate nslookup issues
I'm not a SQL Server expert, but another route for research could be within the related products domain - e.g. see if anyone experienced similar issues with TFS/Sharepoint (e.g. such as http://nickhoggard.wordpress.com/2009/12/07/further-experiences-with-tfs-2010-beta-2-on-amazon-ec2/ )
I have seen this issue in both the EC2 environment and the Windows Azure environment. I think connection retry logic needs to be a standard part of your design when working in a distributed computing environment.
This article is for SQL Azure - but I think it equally applies to EC2 and all drivers.
I can also confirm that this happens and will spin up a lower priority investigation since it's not production critical.
Our production servers are in our data center. We use developer laptops to run our applications. Neither of these get this issue once we configured c3p0 connection pool timeouts and test period (see article: http://www.codefin.net/2007/05/hibernate-and-mysql-connection-timeouts.html).
However, we do have a development staging server that is in EC2 and it does indeed happen there. If I find something that seems to work, I'll ping back. Also, I'm using mysql. I see that you are using MS SQL Server so it is across database vendors.
Related
I am using Apache2 and php 5.6.,12. I decided to host my database remotely at Heroku(Using postgresql 9.4) and keep my server at Digital ocean.
In my yii 1 framework, the connection string that I have added is the following:
'db'=>array(
'connectionString' =>
'pgsql:host=ec2-XX-XX-XX-XX.compute-1.amazonaws.com;port=6372;dbname=dddqXXXXX;sslmode=require',
'emulatePrepare' => true,
'username' => 'XXXX4dcXXXX',
'password' => 'XXXXXXXXXc34XXXXXXX123',
'charset' => 'utf8',
),
The connection is successful but remote access is making it slow for even simple query in my server at digital ocean. I read from Heroku that for remote access, ssl mode has to be enabled. So I did and and I am still unable to figured out why the database connection is slow. It can be slow up to even 5 seconds. I tried with a locally installed postgresql database server and everything is running as expected. I am not sure how can I solve this else I will have to move away from Herokku and do it in the traditional way which is going to be very depressing. I hope that someone can help me.
Here is my php info og pgsql:
Is there some settings that need to be done to speed up remote heroku database access in apache2 or php?
I was unable to ping Postgres Heroku Server as advised by Richard (Heroku has prevented pinging) . It was very obvious that connection between digital ocean server and Heroku Postgres server is slow. Thus I emailed Heroku directly to ask for their advice.
Heroku's Solution:
They claimed that applications which are connecting from long distance outside the Heroku platform will have initial connection latency and this latency is a big problem.
Thus, Application has to establish a TCP connection which Postgres protocol will upgrade that to an SSL connection. This takes quite a few packets and introduces a lot of latency, particularly if the app is creating a new connection for each query or page load.
Heroku recommended me to configure the app to use something like heroku-pgbouncer connection pool. That uses pgbouncer and stunnel to provide a configurable connection pool for the app endpoints.
The recommendation sound too expensive and highly challenging for me to deal with.
My Solution :: Use Database Labs
I found out another postgres as a service provider called Database Labs . They allow users to select data center region for better performance.Database Labs has easy backend managing platform and friendly support team. The backend had minimum backend functionality and I do understand as they started in year 2014.
However, after migrating to their service, the performance of my web page improved remarkably. The connection was like any standard connection without the need for SSL. I am inputing my solution for the benefit of others who could face similar problem like me.
Heroku is definitely a good provider if we host our application in Heroku and use their database service. However If you are a Digital Ocean user, I recommend that you use Use Database Labs . This saves a lot of time
There isn't really a question here exactly, so this answer is more a guide to how to test the situation.
If you don't know enough to run a packet trace, you probably want to make sure your servers are all on the same network. However, try logging in to your Digital Ocean server and just ping the Heroku one. Repeat for www.google.com and compare the times. That's assuming the Heroku server responds to pings.
You should be able to connect with "psql -h ...". Then you can run a "SELECT count(*) FROM " then "SELECT * FROM LIMIT 10000", then "LIMIT 20000". That will let you figure out how much time is spent just transferring data vs running the query.
It might just be that the connection between your servers is very slow. Can't say without testing.
I have hosted my WebApp on server 1 and my database on server 2
But I'm getting following error
Communication with the underlying transaction manager has failed.
I googled and found a post which mentioned that it is the issue of DTC(Distributed Transaction)
I enabled DTC on server2(DB server) and made an exception of it in Firewall.
But still same error.
Here is the full stack trace
Message: System.Transactions.TransactionManagerCommunicationException: Communication with the underlying transaction manager has failed. ---> System.Runtime.InteropServices.COMException: The MSDTC transaction manager was unable to pull the transaction from the source transaction manager due to communication problems. Possible causes are: a firewall is present and it doesn't have an exception for the MSDTC process, the two machines cannot find each other by their NetBIOS names, or the support for network transactions is not enabled for one of the two transaction managers. (Exception from HRESULT: 0x8004D02B)
at System.Transactions.Oletx.IDtcProxyShimFactory.ReceiveTransaction(UInt32 propgationTokenSize, Byte[] propgationToken, IntPtr managedIdentifier, Guid& transactionIdentifier, OletxTransactionIsolationLevel& isolationLevel, ITransactionShim& transactionShim)
at System.Transactions.TransactionInterop.GetOletxTransactionFromTransmitterPropigationToken(Byte[] propagationToken)
Kindly advice
We had the exact same situation, and more than once. Each time, it was one of the following:
The IP address in the DNS for the server is outdated (as said in error message: "two machines cannot find each other by their NetBIOS names"). You can check if this is the case by trying ping servername from one server to another in the command prompt. If the ping by name fails and ping by IP succeeds (or ping by name returns the wrong IP), than you should talk to the System Admins to take a look at DNS/DHCP.
The servers are created as an image of preconfigured server (for example, if you are working with virtual machines, and instead of doing a fresh install for each of the servers, you simply clone the image). This is a problem because DTC has an internal "Identifier" - and in case of image cloning both your installations now have same DTC ID, and won't be able to communicate with each other. The solution is to simply uninstall and install the DTC again.
Hope it helps.
Things to check:
Have you done this configuration on both servers?
Are both servers members of the same domain?
Have you checked the event log?
I had the same problem while connecting to a remote SQl Server.
The solution in my case was to add "enlist=false" to the connection string.
I was missing quite a lot of things:
No authentication (as DB server and APP server and not within same AD domain)
Rule to Windows Firewall enabling msdtc.exe
Rule to firewall between DMZ and internal zone TCP 135,1024-65535 in both directions. The link tell you how to restrict the firewall policy to few ports only.
short / long server names to hosts or a shared DNS server. Eg. 192.168.1.1 app1 as well as 192.168.1.1 app1.domain.local
On the other hand based on this link my setup doesn't require:
Allow Remote Clients
Allow Remote Administration
Enable XA Transactions (required prior Windows Server 2003 SP1)
Solved after adding remote IP\machine name to files on server:
hosts, lmhosts
in folder
C:\Windows\System32\drivers\etc
One of our servers displayed this error after the Virtual Machine (VM) controlling our Domain Controller froze. Several related communication problems also started to pop up (like failed password resets). Resetting the frozen VM fixed the issue.
Lots of helpful answers already given.
One problem for me was the presence of invalid (cyrillic) characters in the computer name.
And there is also a way to validate the connection between two servers (or between a server and a computer) using a small tool from Microsoft called DTCPing.
I've recently moved a classic ASP site from a single-server IIS6 (Window Server 2003) and SQL Server 2005 setup, to a Hyper-V setup running Windows Server 2012 on the host and two VMs (single machine).
Here is a diagram of the current setup:
My problem is that I am getting the following error intermittently:
Named Pipes Provider: Could not open a connection to SQL Server [53].
I've been told and was able to prove that the web-to-DB traffic never uses the physical NIC, so that should rule out any issues w/ the NIC or its drivers/configuration.
I've also made sure that there are no IP conflicts (the host and VM IPs are all different).
The only pattern I can detect is that it seems more likely to happen during peak periods. The odd thing is it can go 7 days without an error, and then on a single day, the error will happen on 50-100 requests, often within the same 30 seconds, or in groups of 30-second intervals.
I've been trying to figure this out for weeks -- since migrating to the new server over 3 weeks ago. If no one here can help, my last resort is to open a ticket with Microsoft. However, I'm not optimistic they will be able to help as I'm not able to reproduce it.
As a last resort, I'm considering moving them back to a single instance, which I'm trying my best to avoid.
Update:
Here is the connection string I'm using:
Provider=SQLNCLI11;Server=[my DB VM IP address];Integrated Security=SSPI;"
Suppose the following:
I have a database set up on database.mywebsite.com, which resolves to IP 111.111.1.1, running from a local DNS server on our network.
I have countless ASP, ASP.NET and WinForms applications that use a connection string utilising database.mywebsite.com as the server name, all running from the internal network.
Then the box running the database dies, and I switch over to a new box with an IP of 222.222.2.2.
So, I update the DNS for database.mywebsite.com to point to 222.222.2.2.
Will all the applications and computers running them have cached the old resolved IP address?
I'm assuming they will have.
Any suggestions along the lines of "don't have your IP change each time you switch box" are not too welcome as I cannot control this aspect of the situation, unfortunately. We are currently using the machine name of the box, which changes every time it dies and all apps etc. have to be updated with the new machine name. It hurts.
Even if the DNS is not cached local to the machine, it will likely be cached somewhere along the DNS chain between the machine and the name servers, at least for a short while. My understanding is this situation would usually be handled with IP takeover where you just make the new machine 111.111.1.1.
Probably a question for serverfault.
You're looking for DNS TTL (Time To Live) I guess.. In my opinion applications may cache the IP for at most the value of the TTL. I'm afraid however that some applications/technologies might actually cache it longer (agian in my opinion completely wrong)
Each machine will cache the ip address.
The length of time it is cached is the TTL (Time To Live). This is a setting on your DNS server, if you set it very low say 5 mins, then you show be up and running fairly quikly. A bit of a hack but it should work.
Yes, the other comments are correct in that what controls this is the DNS TTL set for the hostname database.mywebsite.com.
You'll have to decide what the maximum amount of time you're willing to wait for if you have a failure on your primary address (111.111.1.1) after you make the switch to the secondary address. Lower settings will give you a quicker recovery time, but will also increase the load and bandwidth to your DNS server because clients will have to re-query it to refresh their cache more often.
You can use nslookup using the -d option from your cmd prompt to see what your default TTL times and remaining TTL times are for the DNS server you are querying.
%> nslookup -d google.com
You should assume that they are cashed for two reasons not clearly mentioned before:
1- Many "modern" versions of OS families do DNS caching.
2- Many applications do DNS caching or have poor error/failure detection on live connections and/or opening new connections. This would possibly include your database client.
Also, this is probably not well documented. I did some googling, and found this for MySQL:
http://dev.mysql.com/doc/refman/5.0/en/connector-net-programming-connecting-connection-string.html#connector-net-programming-connecting-errors
It does not clearly explain its behavior in this regard.
I had a similar issue with a web site that disables the application pool recycling features and runs for weeks on end. Sometimes, a clustered SQL Server box would restart and for some reason, my SqlConnection's were not reconnecting. I was getting the error:
A network-related or instance-specific
error occurred while establishing a
connection to SQL Server. The server
was not found or was not accessible.
Verify that the instance name is
correct and that SQL Server is
configured to allow remote
connections. (provider: Named Pipes
Provider, error: 40 - Could not open a
connection to SQL Server)
The server was there - and running - in fact, if I just recycled the app pool, the app would work fine - but I don't like recycling app pools!
The connections that were being held in the connection pool were somehow using old connection information, and that could have been old IP addresses. This is what seems so similar to the poster's question, that it appears to be cached DNS information, because as soon as some sort of a cache is cleared, the app works fine.
This is how I solved it - by forcing all of the connections in the pool to be re-created:
Try
' Example: SqlDependency, but this could also be any SqlConnection.Open call
Dim result As Boolean = SqlClient.SqlDependency.Start(ConnStr)
Catch sqlex As SqlClient.SqlException
SqlClient.SqlConnection.ClearAllPools()
End Try
The code sample is just the boiled-down basics - it should be tweaked for your situation!
The DNS gets cached, but for any server that resolves to the wrong ip address, you can update the HOSTS file of the server and the ip should be updated immediately. This could be a solution if you have a limited amount of servers accessing your database server.
Here is the full error: SqlException: A transport-level error has occurred when receiving results from the server. (provider: Shared Memory Provider, error: 1 - I/O Error detected in read/write operation)
I've started seeing this message intermittently for a few of the unit tests in my application (there are over 1100 unit & system tests). I'm using the test runner in ReSharper 4.1.
One other thing: my development machine is a VMWare virtual machine.
I ran into this many moons ago. Bottom line is you are running out of available ports.
First make sure your calling application has connection pooling on.
If that does then check the number of available ports for the SQL Server.
What is happening is that if pooling is off then every call takes a port and it takes by default 4 minutes to have the port expire, and you are running out of ports.
If pooling is on then you need to profile all the ports of SQL Server and make sure you have enough and expand them if necessary.
When I came across this error, connection pooling was off and it caused this issue whenever a decent load was put on the website. We did not see it in development because the load was 2 or 3 people at max, but once the number grew over 10 we kept seeing this error. We turned pooling on, and it fixed it.
I ran into this many moons ago as well. However, not to discount #Longhorn213s explanation, but we had the exact opposite behavior. We received the error in development and testing, but not production where obviously the load was much greater. We ended up tolerating the issue in development as it was sporadic and didn't materially slow down progress. I think there could be several reasons for this error, but was never able to pin point the cause myself.
We've also run across this error and figured out that we were killing a SQL server connection from the database server. The client application is under the impression that the connection is still active and tries make use of that connection, but fails because it was terminated.
We saw this in our environment, and traced part of it down to the "NOLOCK" hint in our queries. We removed the NOLOCK hint and set our servers to use Snapshot Isolation mode, and the frequency of these errors was reduced quite a bit.
We have seen this error a few times and tried different resolutions with varying success. One common underlying theme has been that the system giving the error was running low on memory. This is especially true if the server that is hosting Sql Server is running ANY other non-OS process. By default SQL Server will grab any memory that it can, then if leaving little for other processes/drivers. This can cause erratic behavior and intermittent messages. It is good practice to configure your SQL Server for a maximum memory that leaves some headroom is there are other processes that might need it. Example: Visual Studio on a dev machine that is running a copy of SQL Server developers edition on the same machine.