I've got a strange issue with a server accepting TCP connections. Even though there are normally some processes waiting, at some volume of connections it hangs.
Long version:
The server is written in Perl and binds a $srv socket with the reuse flag and listen == 5. Afterwards, it forks into 10 processes with a loop of $clt=$srv->accept(); do_processing($clt); $clt->shutdown(2);
The client written in C is also very simple - it sends some lines, then receives all lines available and does a shutdown(sockfd, 2); There's nothing async going on and at the end both send and receive queues are empty (as reported by netstat).
Connections last only ~20ms. All clients behave the same way, are the same implementation, etc. Now let's say I'm accepting X connections from client 1 and another X from client 2. Processes still report that they're idle all the time. If I add another X connections from client 3, suddenly the server processes start hanging just after accepting. The first blocking thing they do after accept(); is while (<$clt>) ... - but they don't get any data (on the first try already). Suddenly all 10 processes are in this state and do not stop waiting. On strace, the server processes seem to hang on read(), which makes sense.
There are loads of connections in TIME_WAIT state belonging to that server (~100 when the problem starts to manifest), but this might be a red herring.
What could be happening here?
After some more analysis: It turned out that the client was at fault, not closing previous connections properly before trying the next one. The servers at the beginning of the load-balancing list were left stale connections.
This probably isn't the solution to your problem, but it might solve a problem you'll experience in the future: don't forget to close() the sockets when you're done! shutdown() will disconnect the stream, but it'll still eat a file descriptor.
Since you said strace is showing processes stuck in read(), then your problem seems to be that the client isn't sending the data you expect it to be sending. You should either fix your client, or add an alarm() to your server processes so that they can survive dead clients.
Does it surge and then pause a long time (circa two minutes or so) and then surge again? If so you may not have your system max open files limit set high enough.
Related
I need to clarify something. I'm making a server/client TCP program in C.
What happens if a client tries to connect (using connect()) when the server is not stuck in accept()? I mean, when it's busy? What does connect() return?
EDIT:
I'm on Linux environment.
if (connect(...) < 0) {
// ERROR AND LEAVE
}
This is what I'm doing in my client. From what I've read and learned, if the server is busy and not accepting, connect() should wait a little bit, and then return -1, if the server is still busy. Is that right?
If so, how do I avoid that "little bit"? I want it to return -1 right away.
From what I've read and learned, if the server is busy and not accepting, connect() should wait a little bit, and then return -1, if the server is still busy. Is that right?
The acceptance of the TCP connection, i.e. the TCP handshake, is fully done in the OS kernel independent from a call to accept. accept just returns already accepted connections to user space. Thus, even if the server is currently busy the connection will succeed as long as there is still space in the pending queue. The size of the pending queue is set with listen. If the pending queue is full since the application did not retrieve accepted connections from it for some time but clients still connected, then the server OS will reject the connect attempt, i.e. connect will fail.
Once in a while my server accept functions just stop working properly anymore.
There is a much deeper story behind this, I'm being flooded with SYN and SYN/ACK packets, my network router goes disco and accept keeps returning ECONNABORTED.... I already tried to debug and fix this specific attack, but without success. By now I gave up and rather look for a more generic server recover solution.
Anyway I figured out that simpy "restarting" the server socket by closing and calling socket again is helping. Theoretically very simple, but practically I'm facing here a huge challenge because (a) the server is quite complex by now and (b) when should I exactly restart the server socket.
My setup is one accept-thread that calls accept and feeds epoll, one listener-thread that listens for epoll read/write etc. events and feeds a queue of a thread pool.
I have not found any literature that guides one through restarting the server socket.
Particularly:
When do I actually restart the server socket? I mean I do not really know if a ECONNABORTED return value from accept is just a aborted connection or the accept/filedescriptor is going banana.
How does closing the server socket affect epoll and connected clients? Should I close the server socket immediately or rather have a buffer time such that all clients have finished first?
Or is it even best to have two alternating server sockets such that if one goes banana I just try the other one.
I am making some assumptions about the things you say in your question all being true and accurate even though some of them seems like they may be misdiagnosed. Unfortunately, you didn't really explain how you reached the conclusions presented, so I really can't do much other than assume they're true.
For example, you don't explain how or why you figured that closing and calling socket again will help. From just the information you gave, I would strongly suspect the opposite is true. But again, without knowing the evidence and rationale that lead you to figure that, all I can do is assume it's true despite my instinct and experience saying it's wrong.
When do I actually restart the server socket? I mean I do not really know if a ECONNABORTED return value from accept is just a aborted connection or the accept/filedescriptor is going banana.
If it really is the case that accepting connections will recover faster from a restart than without one and you really can't get any connections through, keep track of the last successful connection and the number of failures since the last successful connection. If, for example, you've gone 120 seconds or more without a successful connection and had at least four failed connections since the last successful one, then close and re-open. You may need to tune those parameters.
How does closing the server socket affect epoll and connected clients?
It has no effect on them unless you're using epoll on the server socket itself. In that case, make sure to remove it from the set before closing it.
Should I close the server socket immediately or rather have a buffer time such that all clients have finished first?
I would suggest "draining" the socket by calling accept without blocking until it returns EWOULDBLOCK. Then you can close it. If you get any legitimate connections in that process, don't close it since it's obviously still working.
A client that tries to get in between your close and getting around to calling listen on a new socket might get an error. But if they're getting errors anyway, that should be acceptable.
Or is it even best to have two alternating server sockets such that if one goes banana I just try the other one.
A long time ago, port DoS attacks were common because built-in defenses to things like SYN-bombs weren't as good as they are now. In those days, it was common for a server to support several different ports and for clients to try the ports in rotation. This is why IRC servers often accepted connections on ranges of ports such as 6660-6669. That meant an attacker had to do ten times as much work to make all the ports unusable. These days, it's pretty rare for an attack to take out a specific inbound port so the practice has largely gone away. But if you are facing an attack that can take out specific listening ports, it might make sense to open more listening ports.
Or you could work harder to understand the attack and figure out why you are having a problem that virtually nobody else is having.
I have a program that needs to:
Handle 20 connections. My program will act as client in every connection, each client connecting to a different server.
Once connected my client should send a request to the server every second and wait for a response. If no request is sent within 9 seconds, the server will time out the client.
It is unacceptable for one connection to cause problems for the rest of the connections.
I do not have access to threads and I do not have access to non-blocking sockets. I have a single-threaded program with blocking sockets.
Edit: The reason I cannot use threads and non blocking sockets is that I am on a non-standard system. I have a single RTOS(Real-Time Operating System) task available.
To solve this, use of select is necessary but I am not sure if it is sufficient.
Initially I connect to all clients. But select can only be used to see if a read or write will block or not, not if a connect will.
So when I have connected to say 2 clients and they are all waiting to be served, what if the 3rd does not work, the connection will block causing the first 2 connections to time out as well.
Can this be solved?
I think the connection-issue can be solved by setting a timeout for the connect-operation, so that it will fail fast enough. Of course that will limit you if the network really is working, but you have a very long (slow) path to some of the server(s). That's bad design, but your requirements are pretty harsh.
See this answer for details on connection-timeouts.
It seems you need to isolate the connections. Well, if you cannot use threads you can always resort to good-old-processes.
Spawn each client by forking your server process and use traditional IPC mechanisms if communication between them is required.
If you can neither use a multiprocess approach I'm afraid you'll have a hard time doing that.
I realise that I'll get at least one answer along the lines of "(re)write the code so it doesn't hang" but let's assume we don't live in that shiny happy utopia just yet...
In our embedded system we have a big SDK including a web-server (Boa) which is the primary method of user interaction.
It's possible, during certain phases of the moon, that something can cause the web server to hang or become otherwise stuck in such a way that the process appears running normally (not crashed/dead/using 100% CPU) but does not serve any web pages.
So, the question is, how do we test/detect this situation?
To test whether the server is hung, create a TCP socket and connect to port 80 on IP address 127.0.0.1 (loopback address). Then send the following text over the socket
GET / HTTP/1.1\r\n\r\n
Most servers will interpret that as a request for index.html. Alternatively, you could implement an undocumented URL for testing (which allows for a shorter, predetermined response), e.g.
GET /test/fdoaoqfaf12491r2h1rfda HTTP/1.1\r\n\r\n
You then need to read the response from the server. This involves using select with a reasonable timeout to determine whether any data came back from the server, and if so, use recv to read the data. The response from the server will consist of a header followed by content. The header consists of lines of text, with a blank line at the end of the header. Lines end with \r\n, so the end of the header is \r\n\r\n.
Getting the content involves calling select and recv until recv returns 0. This assumes that the server will send the response and then close the socket. Some sophisticated servers will leave a socket open to allow multiple requests over the same socket. A simple embedded server should not be doing that. (If your server is trying to use the same socket for multiple requests, then you need to figure out how to turn that feature off.)
That's all very well and good, but you really need to rewrite your code so it doesn't hang.
The mostly likely cause of the problem is that the server has a bunch of dangling sockets, i.e. connections from clients that were never properly cleaned up. Dangling sockets will eventually prevent the server from accepting more connections, either because the server has a limit on the number of open connections, or because the process that's running the server uses up all of its file descriptors.
The first thing to check is the TCP timeout value. One project that I worked on had a default timeout of 5 hours, which meant that dangling sockets stayed open for 5 hours. A reasonable timeout is 1 minute.
Then you need to create a client that deliberately misbehaves. Clients can misbehave by
leaving a socket open without reading the server's response
abruptly closing the socket while reading the response
gracefully closing the socket while reading the response
The first situation should be handled by the TCP timeout. The other two need to be properly handled by the server code. Graceful and abrupt socket closure is controlled via the SO_LINGER option of ioctl and the shutdown function. After the client misbehaves, check the number of open file descriptors in the server process, to verify that the server has handled the situation correctly.
im writing a multithreaded winsock application and im having some issues with closing the sockets.
first of all, is there a limit for a number of simultaneously open sockets? lets say like 32 sockets all in once.
i establish a connection on one of the sockets, and passing information and it all goes right.
problem is when i disconnect the socket and then reconnect to the same destination, i get a RST from the server after my SYN.
i dont have the code for the server app so i cant debug it.
when i used SO_LINGER and it sent a RST flag at the end of each session - it worked.
but i dont want to end my connections this way.
when not using SO_LINGER a FIN flag was sent but it seems the connection was not really closed.
any help?
thanks
On Unix there's a file descriptor limit per process - I'm guessing on Windows it's "handles".
You are probably bind()-ing your client socket to a fixed port. That might be the reason the server is rejecting your subsequent connection. Try normal ephemeral ports.
Firstly, I agree with Nikolai, are you binding your client socket?
If so it sounds like the socket on the server side is still in TIME_WAIT and is discarding the new connection attempt. By binding the client socket you're forcing the server to try and reuse the exact same connection that is currently in the 2MSL wait period, it can't be reused at this point in time and so you're seeing what you're seeing. There's usually no need to bind the client port, stop doing it and your problem will likely go away.
Secondly, yes, there are limits to the number of open sockets on Windows platforms but they're resource related rather than some hard coded number.
Each open socket uses some 'non paged pool' memory and each pending read or write request on a socket is also likely to use both 'non paged pool' and have pages of memory locked in memory during I/O (there's a limit to the number of pages that can be locked). That said on Vista and later there's much more 'non paged pool' available than on earlier versions of Windows and even then I've managed to achieve more than 70,000 concurrent active connections on a pretty low spec XP box (see here: http://www.lenholgate.com/blog/2005/11/windows-tcpip-server-performance.html). Note that there are some separate limits on the number of outbound connections that you can establish (which is more likely to be of interest to you) but that's around 4000 by default and can be tuned by setting MAX_USER_PORT see here: Maximum number of concurrent TCP/IP connections - Win XP SP3 for more details.