Linux TCP socket crash - c

I write network application which communicates via Linux TCP socket. Recently I've noticed send system call crashing my application. It works fine when both peers are up (I'm testing crash recovery now). But when one peer is down second crashes executing this piece of code.
fprintf(stderr, "out_tcp %d\n", out_tcp);
if(send(out_tcp, &packet, sizeof(packet), 0) == -1)
fprintf(stderr, "send TCP error");
fprintf(stderr, "after send");
Socket is already prepared and connected and was executed several times before second peer went down. I've expected this code returning -1 value, but it produces on output only "out_tcp 11" then application exits. No error message, no returned value from send. I run it under Valgrind, it says application exits normally - no error/warning message.
Does anyone has any idea how to debug it? Tools to use? I'm pretty stuck with this as I get no informations.
Thanks in advance
Harnen

Looks like your application is ignoring SIGPIPE. Please see this thread for further information:
How to prevent SIGPIPEs (or handle them properly)

SOLVED:
USE MSG_EOR,MSG_NOSIGNALflag in send function as below
if(send(out_tcp, &packet, sizeof(packet), **MSG_EOR|MSG_NOSIGNAL**) == -1)
Hope it helps

Have you tried to RTFM (read the fine manual) about error conditions? Do you catch or ignore any signals? What about errno global variable?
man send
And also TCP is a streaming protocol, therefore it is recommended to use usual streaming access commands like read(), write() if you do not need any special flags.

Related

Is it possible to do epoll on accept event?

Let's suppose I've created a listening socket:
sock = socket(...);
bind(sock,...);
listen(sock, ...);
Is it possible to do epoll_wait on sock to wait for incoming connection? And how do I get client's socket fd after that?
The thing is on the platform I'm writing for sockets cannot be non-blocking, but there is working epoll implementation with timeouts, and I need to accept connection and work with it in a single thread so that it doesn't hang if something goes wrong and connection doesn't come.
Without knowing what this non-standard platform is it's impossible to know exactly what semantics they gave their epoll call. But on the standard epoll on Linux, a listening socket will be reported as "readable" when an incoming connection arrives, and then you can accept the connection by calling accept. If you leave the socket in blocking mode, and always check for readability using epoll's level-triggered mode before each call to accept, then this should work – the only risk is that if you somehow end up calling accept when no connection has arrived, then you'll get stuck. For example, this could happen if there are two processes sharing a listening socket, and they both try to accept the same connection. Or maybe it could happen if an incoming connection arrives, and then is closed again before you call accept. (Pretty sure in this case Linux still lets the accept succeed, but this kind of edge case is exactly where I'd be suspicious of a weird platform doing something weird.) You'd want to check these things.
Non-blocking mode is much more reliable because in the worst case, accept just reports that there's nothing to accept. But if that's not available, then you might be able to get away with something like this...
Since this answer is the first up in the results in duckduckgo. I will just chime in to say that under GNU/Linux 4.18.0-18-generic (Ubuntu 18.10).
The asynchronously accept an incoming connection using one has to watch for errno value EWOULDBLOCK (11) and then add the socket to epoll read set.
Here is a small shot of scheme code that achieves that:
(define (accept fd)
(let ((out (socket:%accept fd 0 0)))
(if (= out -1)
(let ((code (socket:errno)))
(if (= code EWOULDBLOCK)
(begin
(abort-to-prompt fd 'read)
(accept fd))
(error 'accept (socket:strerror code))))
out)))
In the above (abort-to-prompt fd 'read) will pause the coroutine and add fd to epoll read set, done as follow:
(epoll-ctl epoll EPOLL-CTL-ADD fd (make-epoll-event-in fd)))
When the coroutine is unpaused, the code proceed after the abort to call itself recursively (in tail-call position)
In the code I am working in Scheme, it is a bit more involving since I rely on call/cc to avoid callbacks. The full code is at source hut.
That is all.

Would the socket descriptor change if a client closed the connection?

We have a server with a limited on the number of incoming connections it can accept.
We have multiple clients connecting to the server at various intervals, for various different reasons.
At least one of the functions of the server requires it to process the client's request and reply back on the same socket. However:
the client complains about timing out (and I believe closes the socket)
the server finishes it's processing successfully, but the thread throws a SIGCHLD because the socket has been closed.
I have code similar to the one below, that checks the socket descriptor.
if (connect_desc > 0)
{
if (write(connect_desc, buffer, sizeof(buffer)) < 0)
{
printf("write error\n");
}
}
else
printf("connect_desc < 0\n");
My question is:
If the socket is closed by the client, would the socket descriptor change in value on the server? If not, is there any way to catch that in my code?
I'm not seeing that last print out.
Q: Will the descriptor change?
A: No
Q: How can I check the status of my connection?
A: One way is simply to try writing to the socket, and check the error status.
STRONG RECOMMENDATION:
Beej's Guide to Network Programming
Q. Will the file descriptor change?
Not unless:
It is documented somewhere you can cite.
The operating system magically knows about the disconnection.
The operating system magically knows where in your application the FD is stored, including all the copies.
The operating system wants to magically make it impossible for you to close the socket yourself.
None of these is true. The question doesn't even make sense.
Q. How can I check the status of my connection.
There isn't, by design, any such thing as the status of a TCP connection. The only way you can detect whether it has failed is by trying to use it.

Linux, sockets, non-blocking connect

I want to create a non-blocking connect.
Like this:
socket.connect(); // returns immediately
For this, I use another thread, an infinite loop and Linux epoll. Like this(pseudocode):
// in another thread
{
create_non_block_socket();
connect();
epoll_create();
epoll_ctl(); // subscribe socket to all events
while (true)
{
epoll_wait(); // wait a small time(~100 ms)
check_socket(); // check on EPOLLOUT event
}
}
If I run a server and then a client, all it works. If I first run a client, wait a some small time, run a server, then the client doesn't connect.
What am I doing wrong? Maybe it can be done differently?
You should use the following steps for an async connect:
create socket with socket(..., SOCK_NONBLOCK, ...)
start connection with connect(fd, ...)
if return value is neither 0 nor EINPROGRESS, then abort with error
wait until fd is signalled as ready for output
check status of socket with getsockopt(fd, SOL_SOCKET, SO_ERROR, ...)
done
No loops - unless you want to handle EINTR.
If the client is started first, you should see the error ECONNREFUSED in the last step. If this happens, close the socket and start from the beginning.
It is difficult to tell what's wrong with your code, without seeing more details. I suppose, that you do not abort on errors in your check_socket operation.
There are a few ways to test if a nonblocking connect succeeds.
call getpeername() first, if it failed with error ENOTCONN, the connection failed. then call getsockopt with SO_ERROR to get the pending error on the socket
call read with a length of 0. if the read failed, the connection failed, and the errno for read indicates why the connection failed; read returns 0 if connection succeeds
call connect again; if the errno is EISCONN, the connection is already connected and the first connect succeeded.
Ref: UNIX Network Programming V1
D. J. Bernstein gathered together various methods how to check if an asynchronous connect() call succeeded or not. Many of these methods do have drawbacks on certain systems, so writing portable code for that is unexpected hard. If anyone want to read all the possible methods and their drawbacks, check out this document.
For those who just want the tl;dr version, the most portable way is the following:
Once the system signals the socket as writable, first call getpeername() to see if it connected or not. If that call succeeded, the socket connected and you can start using it. If that call fails with ENOTCONN, the connection failed. To find out why it failed, try to read one byte from the socket read(fd, &ch, 1), which will fail as well but the error you get is the error you would have gotten from connect() if it wasn't non-blocking.

Waiting for data via select not working

I'm currently working on a project which involves multiple clients connected to a server and waiting for data. I'm using select and monitoring the connection for incoming data. However, the client just continues to print nothing, acting as if select has discovered incoming data. Perhaps I'm attacking this wrong?
For the first piece of data the server does send, it is displayed correctly. However, the server then disconnects and the client continues to spew blank lines.
FD_ZERO(&readnet);
FD_SET(sockfd, &readnet);
while(1){
rv = select(socketdescrip, &readnet, NULL, NULL, &timeout);
if (rv == -1) {
perror("select"); // error occurred in select()
} else if (rv == 0) {
printf("Connection timeout! No data after 10 seconds.\n");
} else {
// one or both of the descriptors have data
if (FD_ISSET(sockfd, &readnet)) {
numbytes = recv(sockfd, buf, sizeof buf, 0);
printf("Data Received\n");
buf[numbytes] = '\0';
printf("client: received '%s'\n",buf);
sleep(10);
}
}
}
I think you need to check the result of recv. If it returns zero, I believe it means the server has closed the socket.
Also (depending on the implementation), you may need to pass socketdescrip+1 to select.
If I remember correctly, you need to initialise set of fds before each call to select() because select() corrupts it.
So move FD_ZERO() and FD_SET() inside the loop, just before select().
acting as if select has discovered
incoming data. Perhaps I'm attacking
this wrong?
In addition to what was said before, I'd like to note that select()/poll() do tell you not when "data are there" but rather that next corresponding system call will not block. That's it. As was said above, recv() doesn't block and properly returns 0, what means EOF, connection was closed by the other side.
Though on most *nix systems in the case only first call of recv() would return 0, following calls would return -1. When using async I/O rigorous error checking is a must!
And personally I would strongly suggest to use poll() instead. Unlike select(), it doesn't destroy its arguments and works fine with high numbered socket descriptors.
When server closes the connection, it will send a packet taking FIN flag to client side to announce that it no longer sends data. The packet is processed by TCP/IP stack at the client side and has no data for application level. The application level is notified to trigger select because something happened on the monitored file descriptor, and recv() return 0 bytes because no data sent by server.
Is this true when talking about your code?
select(highest_file_descriptor+1, &readnet, NULL, NULL, &timeout);
In your simple example (with FD_ZERO and FD_SET moved inside the while(1) loop as qrdl said) it should look like this:
select(sockfd+1, &readnet, NULL, NULL, &timeout);
Also - please note that when recv returns 0 bytes read it means that connection was closed - no more data! Your code is also buggy - when something bad happens on recv (it returns <0 when this happens) you will have serious trouble because something like buf[-1] may lead to unpredictable results. Please handle this case properly.
While I respect the fact that you try to use the low-level BSD sockets API I must say that I find it awfully inefficient. That's why I recommend to you if possible to use ACE which is a very efficient and productive framework which has a lot of things already implemented when it comes to network programming (ACE_Reactor for example is something that makes it easier to do what you're trying to achieve here).

Reusing socket descriptor on connection failure

In my client code, I am following these steps to connect to a socket:
Creating a socket
sockDesc = socket(PF_INET, SOCK_STREAM, IPPROTO_TCP)
Connecting it (retry for 'x' time in case of failure)
connect(sockDesc, (sockaddr *) &destAddr, sizeof(destAddr))
(After filling the destAddr fields)
Using the socket for send()/recv() operation:
send(sockDesc, buffer, bufferLen, 0)
recv(sockDesc, buffer, bufferLen, 0)
close() the socket descriptor and exit
close(sockDesc)
If during send()/recv() the connection breaks, I found that I could connect by returning to step 2.
Is this solution okay? should I close the socket descriptor and return to step 1?
Another interesting observation that I am not able to understand is when
I stop my echo server and start the client. I create a Socket (step 1) and call connect() which fails (as expected) but then I keep calling connect(), lets say, 10 times. After 5 retries I start the server and connect() is successful. But during the send() call it receives SIGPIPE error. I would like to know:
1) Do I need to create a new socket every time connect() fails? As per my understanding as long as I have not performed any send()/recv() on the socket it is as good as new and I can reuse the same fd for the connect() call.
2) I don't understand why SIGPIPE is received when the server is up and connect() is successful.
Yes, you should close and go back to step 1:
close() closes a file descriptor,
so that it no longer refers to any
file and may be reused.
From here.
I think closing the socket is the right thing to do, despite the fact that it may work if you don't.
A socket which has failed to connect may not be in EXACTLY the same state as a brand new one - which could cause problems later. I'd rather avoid the possibility and just make a new one. It's cleaner.
TCP sockets hold a LOT of state, some of which is implementation-specific and worked out from the network.
Sockets corresponding to broken connection is in unstable state. normally you will not be allowed to connect to again unless the operating system release the socket.
I think it will be better to close() and connect again.. you don't have to create another socket.
Anyway, make sure to set LINGER of your socket to ensure that no data is lost in transmision.
See http://www.gnu.org/s/libc/manual/html_node/Socket_002dLevel-Options.html#Socket_002dLevel-Options
If the connection was broken and you try to write on the file descriptor you should get the broken pipe error/signal. All this is saying is that the file descriptor you tried writing to no longer has anyone on the other side to read what you are sending.
What you can do is catch the signal SIGPIPE and then deal with the reconnecting by closing the FD and going back to your step 1. You will now have a new FD you can read and write from for the connection.
If the Single UNIX Specification doesn't say that it MUST work to go back to step #2 instead of step #1, then the fact that it happens to work on Linux is just an implementation detail, and you would be far better off and more portable if you go back to step #1. As far as I am aware, the specification does not make any guarantee that it is ok to go back to step #2 and, therefore, I would advise you to go back to step #1.

Resources