How to debug socket resets in C - c

I have some code that establishes a connection to servers. For a while, my code runs normally and everything works fine.
At some point though, when trying to connect out, my application sends a SYN flag , gets a SYN/ACK, and then will start sending a FIN flag which terminates the connection!
This is using FreeBSD 9. I have checked all of the limits, and as far as I can tell, I am not exceeding any open socket limits or anything, and I would have not exected the socket to open to even send the SYN flag if it was something like that going on.
What else can I do to debug this? After it happens for one outbound connection, it starts happening for all of them, so I think it must be some kind of systemic problem.

It's more likely to be a coding whoops than some complicated, sinister networking issue. I agree with netcoder, check your calls, check your return values. Check that you're not doing something daft, like eating file descriptors! Check your firewalls at both ends, I've seen that effect with a firewall getting overly protective before. Or post some code for us to look at...

Related

How to properly restart server socket?

Once in a while my server accept functions just stop working properly anymore.
There is a much deeper story behind this, I'm being flooded with SYN and SYN/ACK packets, my network router goes disco and accept keeps returning ECONNABORTED.... I already tried to debug and fix this specific attack, but without success. By now I gave up and rather look for a more generic server recover solution.
Anyway I figured out that simpy "restarting" the server socket by closing and calling socket again is helping. Theoretically very simple, but practically I'm facing here a huge challenge because (a) the server is quite complex by now and (b) when should I exactly restart the server socket.
My setup is one accept-thread that calls accept and feeds epoll, one listener-thread that listens for epoll read/write etc. events and feeds a queue of a thread pool.
I have not found any literature that guides one through restarting the server socket.
Particularly:
When do I actually restart the server socket? I mean I do not really know if a ECONNABORTED return value from accept is just a aborted connection or the accept/filedescriptor is going banana.
How does closing the server socket affect epoll and connected clients? Should I close the server socket immediately or rather have a buffer time such that all clients have finished first?
Or is it even best to have two alternating server sockets such that if one goes banana I just try the other one.
I am making some assumptions about the things you say in your question all being true and accurate even though some of them seems like they may be misdiagnosed. Unfortunately, you didn't really explain how you reached the conclusions presented, so I really can't do much other than assume they're true.
For example, you don't explain how or why you figured that closing and calling socket again will help. From just the information you gave, I would strongly suspect the opposite is true. But again, without knowing the evidence and rationale that lead you to figure that, all I can do is assume it's true despite my instinct and experience saying it's wrong.
When do I actually restart the server socket? I mean I do not really know if a ECONNABORTED return value from accept is just a aborted connection or the accept/filedescriptor is going banana.
If it really is the case that accepting connections will recover faster from a restart than without one and you really can't get any connections through, keep track of the last successful connection and the number of failures since the last successful connection. If, for example, you've gone 120 seconds or more without a successful connection and had at least four failed connections since the last successful one, then close and re-open. You may need to tune those parameters.
How does closing the server socket affect epoll and connected clients?
It has no effect on them unless you're using epoll on the server socket itself. In that case, make sure to remove it from the set before closing it.
Should I close the server socket immediately or rather have a buffer time such that all clients have finished first?
I would suggest "draining" the socket by calling accept without blocking until it returns EWOULDBLOCK. Then you can close it. If you get any legitimate connections in that process, don't close it since it's obviously still working.
A client that tries to get in between your close and getting around to calling listen on a new socket might get an error. But if they're getting errors anyway, that should be acceptable.
Or is it even best to have two alternating server sockets such that if one goes banana I just try the other one.
A long time ago, port DoS attacks were common because built-in defenses to things like SYN-bombs weren't as good as they are now. In those days, it was common for a server to support several different ports and for clients to try the ports in rotation. This is why IRC servers often accepted connections on ranges of ports such as 6660-6669. That meant an attacker had to do ten times as much work to make all the ports unusable. These days, it's pretty rare for an attack to take out a specific inbound port so the practice has largely gone away. But if you are facing an attack that can take out specific listening ports, it might make sense to open more listening ports.
Or you could work harder to understand the attack and figure out why you are having a problem that virtually nobody else is having.

Windows socket error code 10055

I've developed an app that uses sockets over windows. It works perfectly but after some time, the internet connection begin to fail and finally I get this error (10055), which means that my app run out of buffer space.
Actually I think I am only using 2 sockets with the code i did by myself, but it's true that I'm using a 3rd party library that I have no idea how it's implemented.
I've read that there are lot of literature about this trouble, so I am not the only that suffers from it, but I cannot realise how to solve it, or at least, by-pass it, because when it fails, it makes my computer to lose internet connection. I've tried it by catching this error and when it occurs, doing a WSACleanup(), WSAStartup() even when it's not the best practise... but my app still get stacked in this error.
Any advice will be pretty much appreciated.
Usually this happens when you dnt close your socket properly. Make sure you have both shutdown and closesocket when you want to close the socket (http://msdn.microsoft.com/en-us/library/windows/desktop/ms741394(v=vs.85).aspx) From MSDN - "Note To assure that all data is sent and received on a connection, an application should call shutdown before calling closesocket"
Before you bind the socket, you can use SO_REUSEADDR for setsocketopt which will "Allows the socket to be bound to an address that is already in use" (http://msdn.microsoft.com/en-us/library/windows/desktop/ms740476(v=vs.85).aspx)
Finally, look at this blog - http://blogs.technet.com/b/yongrhee/archive/2011/12/19/how-to-troubleshoot-a-handle-leak.aspx
You have one or more resource leaks in your application.
Without the code I can only give general recommendations.
I recommend that you run Valgrind or similar tools to help you find the resource leak.
Another way is by reviewing the code.
If the leak started recently you can probably find it by reviewing just recent changes.
MSDN has an article on how to locate memory leaks using Visual Studio. (Remember to choose your version of Visual Studio on the linked page).
One cause of this error in Windows is the exhaustion of the ephemeral TCP ports pool.
It's easy to reproduce this error: just create a program that loops in binding port 0.
Very soon this error will happen.
When we pass a 0 to the bind socket function, Windows chooses an ephemeral port to use.

How to use SO_KEEPALIVE option properly to detect that the client at the other end is down?

I was trying to learn the usage of option SO_KEEPALIVE in socket programming in C language under Linux environment.
I created a server socket and used my browser to connect to it. It was successful and I was able to read the GET request, but I got stuck on the usage of SO_KEEPALIVE.
I checked this link keepalive_description#tldg.org but I could not find any example which shows how to use it.
As soon as I detect the client's request on accept() function I set the SO_KEEPALIVE option value 1 on the client socket. Now I don't know, how to check if the client is down, how to change the time interval between the probes sent etc.
I mean, how will I get the signal that the client is down? (Without reading or writing at the client - I thought I will get some signal when probes are not replied back from client), how should I program it after setting the option SO_KEEPALIVE on).
Also if suppose the probes are sent every 3 secs and the client goes down in between I will not get to know that client is down and I may get SIGPIPE.
Anyways importantly I wanna know how to use SO_KEEPALIVE in the code.
To modify the number of probes or the probe intervals, you write values to the /proc filesystem like
echo 600 > /proc/sys/net/ipv4/tcp_keepalive_time
echo 60 > /proc/sys/net/ipv4/tcp_keepalive_intvl
echo 20 > /proc/sys/net/ipv4/tcp_keepalive_probes
Note that these values are global for all keepalive enabled sockets on the system, You can also override these settings on a per socket basis when you set the setsockopt, see section 4.2 of the document you linked.
You can't "check" the status of the socket from userspace with keepalive. Instead, the kernel is simply more aggressive about forcing the remote end to acknowledge packets, and determining if the socket has gone bad. When you attempt to write to the socket, you will get a SIGPIPE if keepalive has determined remote end is down.
You'll get the same result if you enable SO_KEEPALIVE, as if you don't enable SO_KEEPALIVE - typically you'll find the socket ready and get an error when you read from it.
You can set the keepalive timeout on a per-socket basis under Linux (this may be a Linux-specific feature). I'd recommend this rather than changing the system-wide setting. See the man page for tcp for more info.
Finally, if your client is a web browser, it's quite likely that it will close the socket fairly quickly anyway - most of them will only hold keepalive (HTTP 1.1) connections open for a relatively short time (30s, 1 min etc). Of course if the client machine has disappeared or network down (which is what SO_KEEPALIVE is really useful for detecting), then it won't be able to actively close the socket.
As already discussed, SO_KEEPALIVE makes the kernel more aggressive about continually verifying the connection even when you're not doing anything, but does not change or enhance the way the information is delivered to you. You'll find out when you try to actually do something (for example "write"), and you'll find out right away since the kernel is now just reporting the status of a previously set flag, rather than having to wait a few seconds (or much longer in some cases) for network activity to fail. The exact same code logic you had for handling the "other side went away unexpectedly" condition will still be used; what changes is the timing (not the method).
Virtually every "practical" sockets program in some way provides non-blocking access to the sockets during the data phase (maybe with select()/poll(), or maybe with fcntl()/O_NONBLOCK/EINPROGRESS&EWOULDBLOCK, or if your kernel supports it maybe with MSG_DONTWAIT). Assuming this is already done for other reasons, it's trivial (sometimes requiring no code at all) to in addition find out right away about a connection dropping. But if the data phase does not already somehow provide non-blocking access to the sockets, you won't find out about the connection dropping until the next time you try to do something.
(A TCP socket connection without some sort of non-blocking behaviour during the data phase is notoriously fragile, as if the wrong packet encounters a network problem it's very easy for the program to then "hang" indefinitely, and there's not a whole lot you can do about it.)
Short answer, add
int flags =1;
if (setsockopt(sfd, SOL_SOCKET, SO_KEEPALIVE, (void *)&flags, sizeof(flags))) { perror("ERROR: setsocketopt(), SO_KEEPALIVE"); exit(0); };
on the server side, and read() will be unblocked when the client is down.
A full explanation can be found here.

How to find where a process is stuck using DDD

I have a TCP Svr process written in C and running on CentOS 5.5. It acts as a TCP Server for external clients and also does some IPC communication with other processes in the system using Unix Domain Sockets it has establised. It's not a multi threaded process. It does one task at a time. There's an epoll_wait() I use to listen for requests on either the TCP socket or any of the IPC sockets it has established with internal processes. When the epoll_wait() function breaks,I process the request for whoever it is and then go back into epoll_wait()
I have a TCP Client that connects to this Process from outside (not IPC). It connects sucessfully, sends a request msg, gets a response back. I've put this in an infinite loop
just to test out its robustness etc.
After a while, the TCP Server stops responding to requests coming from TCP Client. The TCP client connects successfully, sends a request message, but it doesnt get any response msg back from the TCP server.
So I reckon the TCP server is stuck somewhere else, trying to do something and has not returned to the epoll_wait() to process
other requests coming in. I've tried to figure it out using logs, but thats not helping me understand where exactly the process is stuck.
So I wanted to use any debugger that can give me some information (function name would be great), as to what the process is doing. Putting breakpoints, is overwhelming cause the TCP Server process has tons of files and functions....
I'm trying to use DDD on CentOS 5.5, to figureout whats going on. I attach to the process successfully. Then I click on "Step" or "Stepi" or "Next" button....
but nothing happens....
btw when I use Eclipse for debugging, and attach to this process (or any process), I always get "__kernel_vsyscall()"....Does this mean, the program breaks by default at
whatever its doing? If thats the case, how do I come out of the __kernel_vsyscall() call, to continue within my program? If I press f8, it comes out, but then I dont know where it was, since I loose the stack trace....Like I said earlier. Since I cant figure where it was, I dont know where to put breakpoint....
In summary, I want to figureout where my process is stuck or what its doing and try to debug from that point on....
How do I go about this?
Thanks
Amit
1) Attaching to a C process can often cause problems in itself, is there any way for you to start the process in the debugger?
2) Using the step functions of DDD need to be done after you've set a breakpoint and the program is stopped on a command. From reading your question, I'm not sure you've done that. You may not want to set many breakpoints, but is setting one or two in critical sections of code possible?
In summary, What I wanted to accomplish was to be able to find where my program is stuck, when it hangs. I figured it out - It was so simple. Create a configuration in Eclipse ...."Debug Configurations->C/C++ attach to application"...
Let the process run normally from shell (preferably with a terminal attached). When it hangs, open eclipse, click on the debug icon and run the configured process. It'll ask you to attach to a process. Look for your process name and attach to it.
Now, just look at the entire stack trace....you'll see some of your own function calls mixed with kernel function calls. That tells you where the program is stuck.

SO_LINGER and closing sockets(WINSOCK)

im writing a multithreaded winsock application and im having some issues with closing the sockets.
first of all, is there a limit for a number of simultaneously open sockets? lets say like 32 sockets all in once.
i establish a connection on one of the sockets, and passing information and it all goes right.
problem is when i disconnect the socket and then reconnect to the same destination, i get a RST from the server after my SYN.
i dont have the code for the server app so i cant debug it.
when i used SO_LINGER and it sent a RST flag at the end of each session - it worked.
but i dont want to end my connections this way.
when not using SO_LINGER a FIN flag was sent but it seems the connection was not really closed.
any help?
thanks
On Unix there's a file descriptor limit per process - I'm guessing on Windows it's "handles".
You are probably bind()-ing your client socket to a fixed port. That might be the reason the server is rejecting your subsequent connection. Try normal ephemeral ports.
Firstly, I agree with Nikolai, are you binding your client socket?
If so it sounds like the socket on the server side is still in TIME_WAIT and is discarding the new connection attempt. By binding the client socket you're forcing the server to try and reuse the exact same connection that is currently in the 2MSL wait period, it can't be reused at this point in time and so you're seeing what you're seeing. There's usually no need to bind the client port, stop doing it and your problem will likely go away.
Secondly, yes, there are limits to the number of open sockets on Windows platforms but they're resource related rather than some hard coded number.
Each open socket uses some 'non paged pool' memory and each pending read or write request on a socket is also likely to use both 'non paged pool' and have pages of memory locked in memory during I/O (there's a limit to the number of pages that can be locked). That said on Vista and later there's much more 'non paged pool' available than on earlier versions of Windows and even then I've managed to achieve more than 70,000 concurrent active connections on a pretty low spec XP box (see here: http://www.lenholgate.com/blog/2005/11/windows-tcpip-server-performance.html). Note that there are some separate limits on the number of outbound connections that you can establish (which is more likely to be of interest to you) but that's around 4000 by default and can be tuned by setting MAX_USER_PORT see here: Maximum number of concurrent TCP/IP connections - Win XP SP3 for more details.

Resources