SYN receives RST,ACK very frequently - sql-server

Hi Socket Programming experts,
I am writing a proxy server on Linux for SQL server 2005/2008 running on Windows.
The proxy is coded using bsd sockets and in C, and it is working fine with a problem described below.
When I use a database client (written in JAVA, and running on a Linux box) to fire queries (with a concurrency of 100 or more) directly to the Database server, not experiencing connection resets. But through my proxy I am experiencing many connection resets.
Digging deeper I came to know that connection from 'DB client' to 'Proxy' always succeeds
but when the 'Proxy' tries to connect to the DB server the connection fails, due to the SYN packet getting RST,ACK.
That was to give some background. The question is :
Why does sometimes SYN receives RST,ACK?
DB client(linux) to Server(windows) ----> Works fine
DB client(linux) to Proxy(Linux) to Server(windows) -----> problematic
I am aware that this can happen in "connection refused" case but this definitely is not that one. SYN flooding might be another scenario, but that does not explain fine behavior while firing to Server directly.
I am suspecting some socket option setting may be required, that the client does before connecting and my proxy does not. Please put some light on this. Any help (links or pointers) is most appreciated.
Additional info:
Wrote a C client that does concurrent connections, which takes concurrency as an argument. Here are my observations:
-> At 5000 concurrency and above, some connects failed with 'connection refused'.
-> Below 2000, it works fine.
But the actual problem is observed even at a concurrency of 100 or more.
Note: The problem is time dependent sometimes it never comes at all and sometimes it is very frequent and DB client (directly to server) works fine at all times .

SQL Server needs worker threads to accept incoming connections. If your server is worker starved (which can be easily diagnosed by a high number of entries in sys.dm_os_tasks in PENDING state) then attempting to open new connection will fail. So likely what I suspect it happens is that you're pushing to the server more workload that it can handle. You need to optimize the workload or get a beefier server.
Clients like Java client make effective use of connection pooling and, even under a high load, do not need to open new connection hence you do not see this problem, instead you see only delays in requests completion.

A listening socket keeps queue of established connections and connections in establishment process (e.g. SYN got, SYNACK replied, but not ACK from client yet). If a established queue overflows, IP stack reaction differs on OS. The most traditional approach was to ignore newcoming SYNs, waiting when userland accept()s and frees a slot in the queue. With SYN flooding attacks in mid-90s, the new method was invented named "SYN cookies" which drops need for establishment queue totally, in cost of need to support special TCP option.
OTOH I have heard that Windows stacks change their behavior - under some conditions, the reaction to queue overflow is RST response. In earlier stacks (e.g. Win95) this was the main response and client side was correspondingly changed to ignore RST response to SYN:(
That's why I guess that some proxy host feature triggers RST in Windows stack.
Another guess is that DB server closes listening socket at all under some condition (e.g. detected overload peak) which appears only with proxy.

When the SYN receives the RST response, it should not be the problem of SQL-SERVER.
Because the application is able to accept the socket only after the tcp handshake finished.
Is there any device between the Proxy and the SQL-Server machines?
Try to make sure that the RST response come from sql-server machine.
You connection count is far from SYN-FLOOD, I think.

Related

What is the cause of a sporadic ORA-12599: TNS:cryptographic checksum mismatch?

In our production environment there are sometimes peaks of ORA-12599 errors the connection works otherwise.
The description is not quite clear to me.
Cause: The data received is not the same as the data sent.
Action:
Attempt the transaction again. If error persists, check (and correct)
the integrity of your physical connection.
https://www.oraexcel.com/database-oracle-11gR2-ORA-12599
Does that mean there is a problem with my network cable? Or any other Hardware?
Because our DBA said the error can be caused by a configuration issue on the client side.
Edit:
It started without any changes on the client or server side. And there are spikes. 80 errors in an hour and then no errors for 4 hours while the load is more or less constant
It can mean a mismatch in how the client and the server are wishing to communicate (in encrypted fashion). For example, trying to connect to an 11g database with 10g JDBC drivers will get this because the encryption implementation changed between these versions.
Check your local (ie, client) sqlnet.ora file - you may need to opt for a different method or change the order of them.
The error can be typically be ignored (its informational) but using the latest drivers will normally resolve the issue.
We had sporadic ORA-12599 errors and also ORA-24300 ("bad value for mode") in a Python web app using Flask and uwsgi. It turned out to be caused by multiple worker processes writing and reading to/from the same TCP connection to the DB, giving chaotic communications on the socket.
In our setup this was caused by creating the TCP connections to the DB before uwsgi forked the workers, and the file descriptors being inherited. We fixed this by using uwgsi's lazy-apps mode.
See also this question.

SQL Server Service Broker not sending messages, unable to start endpoint

I had a perfectly working SQL Server Service Broker this morning, until I tested how it recovers from crashing.
I forced a system shutdown on the sender during a messaging session between servers over a network. I was sending binary messages of about 5mb size. There are automatic procedures for sending, replying and receiving messages and ending conversations from both sides in place and my setup uses certificates for security.
I am now unable to send any messages from the server side.
Both sides of the messaging chain have queues on and it does not seem like poison message handling would be causing this. The sender side accepts new messages but is not sending them.
The sender side transmission queue has messages with transmission_status
The Service Broker endpoint cannot listen for connections due to the following error: '10013(An attempt was made to access a socket in a way forbidden by its access permissions.)'.
Running ALTER ENDPOINT myendpoint STATE = STARTED returns the same error as above.
Running select * from sys.endpoints shows the endpoint with state_desc = STARTED anyhow..
Running select state_desc from [sender_database].sys.conversation_endpoints shows state_desc = CONVERSING for all results.
Running SELECT COUNT(*) FROM dbo.sender_queue returns 0.
There is no other traffic to the port my endpoint is using, at least not any that is visible with netstat or the TCPView tool. The ports have rules to allow traffic from the firewall and sqlagent and sqlsrvr processes also have extra rules to be allowed.
Using ssbdiagnose tool with ssbdiagnose -level info configuration from service... from the sender side shows a (not new) error
The route for service sender_service is classified as REMOTE. This will result in the message being forwarded.
along with some other errors about certificates that have always been there even when messaging was working. Ssbdiagnose with RUNTIME flag shows nothing at all.
Ssbdiagnose from the target side now says an exception occurs during connection. The target database also has a couple of reply messages stuck in the transmission queue with an empty transmission_status.
Edit: Seems that occasionally the status on the target side changes to the error 10060 connection failed...
What more can I do to diagnose the problem and fix it?
Edit: I tried changing the port the endpoint uses but the same error is thrown.
Edit: I am able to ping the servers from each other. Ssbdiagnose with RUNTIME option on target side says it cannot find the connection to the SQL Server that corresponds to the routing address of my sender endpoint/database.
The Service Broker endpoint cannot listen for connections due to the following error: '10013(An attempt was made to access a socket in a way forbidden by its access permissions.)'
WSAEACCESS (10013) is a rather unusual socket listen error. I never encountered it before. A quick search reveals KB3039044: Error 10013 (WSAEACCES) is returned when a second bind to a excluded port fails in Windows which is an acknowledged bug in Windows Server 2008R2, 2012 and 2012R2 when excluding a range of ports (netsh ... add excludedportrange ...). So my first question is, are you on one of the affected server OSes and are you actually using a network port exclusion range?
I strongly urge you to open a Microsoft support case for this issue and follow up with them, making sure networking guys are involved (again, WSAEACCESS is rather unusual symptom). This is not one of the usual issues and it is difficult to diagnose over forums discussion.

persistent tcp socket connection in c

I have a C service application that use tcp socket for connection to a server. The server sends data now and then. Also my application sends a hearbeat every 15 seconds. But sometimes it disconnects while server seems to think the connection is live. Now if I try to reconnect the server refuses as it holds only one connection for the client at a time.
What is the best way to hold a persistent tcp connection?
Edit:
The server usually disconnects after 2 min without heartbeat. So after I find my connection is closed it takes 2min for me to successfully reconnect. I want to minimize this time.
The simplest fix is probably for the server to allow a new connection to replace an old connection rather than rejecting it. That would still keep only one connection to each client at a time.

Will the tcp connections between two applications on the same Linux machine ever disconnect

I am using a connection pool to connect to a database on the same Linux machine. I want to be absolutely sure that the connection I get from the connection pool is valid. I am now testing the connection on borrowing but theoretically the tcp connection can still disconnect between the validation and the actual request. Additionally, testing before each request hurts latency and throughput.
How about using a file socket? Will it ever disconnect?
Update: I am wondering if the connection could be broken at the networking layer. Is it true that only the app or the DB can actively end the connection?
I am now testing the connection on borrowing but theoretically the tcp connection can still disconnect between the validation and the actual request.
True but unlikely.
Additionally, testing before each request hurts latency and throughput.
(1) Only if you are borrowing a new connection for each request. (2) Have you measured the 'hurt'?
How about using a file socket? Will it ever disconnect?
You certainly can't assume that it won't. The database end may close it on idle, for example.

General network error after a night of inactivity

For some time now our flagship application has been having mysterious errors. The error message is the generic
[DBNETLIB][ConnectionWrite (send()).]General network error. Check your network documentation.
This is reliably reproduced by leaving the app open for the night and resuming work in the morning. Since it's a backend server app this is a normal scenario.
The funny thing is - we've migrated from SQL Server 7 to 2000 to 2008 and the issue is present on all of them. But what seems to matter is the OS on which we run the app. On WinXP it works fine, on Vista/7 it fails. So the problem is at the client end.
The results of Google on the error message cover a very wide spectrum of different causes (since this is a very generic error) and none of the scenarios found there are similar to ours.
So perhaps someone around here will know what the problem is in our case?
You should be able to reproduce this error condition on demand by:
1. Opening a database connection (in your client application)
2. Unplugging the network cable
3. Plugging network cable back in (wait until the network connection is restored)
4. Using the previously opened connection to query the database
As far as I can tell from experience, client side ADO code is not able to consistently determine if an underlying network connection is actually valid or not. Checking if the database connection is open (in the client code) returns true. However, performing any operations on that connection results in a General network error.
The connection pool appears to be able to determine when a connection goes 'bad' so it never returns a bad connection to the application. It simply opens a new connection instead.
So, if a database connection is kept alive for a long time (used or unused) by the application, the underlying TCP/IP connectivity can get broken.
The bottom line is that database connections should be closed and returned back to the connection pool when not in use.
Edit
Also, depending on the number of clients connecting to the db, not using the connection pool can cause another issue. You may hit the maximum number of sockets open on the server side. This is from memory. Once a connection is closed on the client side, the connection on the server goes into a TIME_WAIT state. By default, the server socket takes about 4 minutes to close, so it is not available to other clients during that time. The bottom line is that there is a limited number of available sockets on the server. Keeping too many connections open can create a problem.
One project I worked on easily hit this socket limit with around 120 users. A new 'feature' was added that absolutely hammered the server, and after a few hours of using the app, things would suddenly slow to a crawl for everyone. SQL server was not closing enough sockets in time for new connection requests. Although there are 65K sockets altogether, only the first 5000 are made available to the ADO (this is a default registry setting thing, so can be changed).
The number of sockets in TIME_WAIT state would slowly build up until the OS would not allocate any more. So clients had to wait until server side sockets closed and a new connection could then be created.
Have you tried disabling SNP/TCP Chimneying?
Had a similar error. For me it was indirectly caused by mismatched calls to WSACleanup and WSAStartup.
The program called WSACleanup more times than WSAStartup. This would cause a reference counter (somewhere in the sockets library) to reach zero too early.
I think effectively from that moment on all sockets owned by the process are broken.
And this would also kill the SQL client since it uses sockets to 'talk' to the SQL server as well.

Resources