Socket connection ends with Operation now in progress on non-blocking socket - c

I have a problem with connecting to a destination IP using connect() API. The connect() API returns a -1 and errno as operation in progress
. Am I checking the return code too early before it establishes a connection? Please see the following code snippet:
struct sockaddr_in servAddr;
servAddr.sin_family = AF_INET;
servAddr.sin_port = htons(9190);
const char * remoteIp = 10.10.20.86;
rc = inet_pton(AF_INET,remoteIp, &servAddr.sin_addr);
if (rc == -1 || errno == EAFNOSUPPORT)
{
return 0;
}
rc = connect(fd, (sockaddr*)&servAddr, sizeof(servAddr));
if ( rc < 0) // this is where it fails. rc is -1.
{
log("connect failure with [%s]",strerror(errno));
print_sock_connect_error();
}
I have 2 questions here:
The destination IP and port 10.10.20.86:9190 is waiting for a connection and once the connection is received, it send the ack back to the source. I see the tcp established - ACK,SYN/ACK and ACK to destination - in pcap but still couldn't figure out why it returns -1 with error. So Am I checking the rc before the connection establishment is complete? sysctl net.ipv4.tcp_syn_retries is set to 6.
Is there anything wrong with the code above?

Am I checking the rc before the connection establishment is complete?
Yes, you are. The TCP ping-pong during the connection's set up isn't all that has to be done.
Is there anything wrong with the code above?
Well, yes, either the way it handles the EINPROGRESS case or that is uses a non-blocking socket to connect.
From connect()'s Linux documentation:
EINPROGRESS
The socket is nonblocking and the connection cannot be
completed immediately. It is possible to select(2) or poll(2)
for completion by selecting the socket for writing. After
select(2) indicates writability, use getsockopt(2) to read the
SO_ERROR option at level SOL_SOCKET to determine whether
connect() completed successfully (SO_ERROR is zero) or
unsuccessfully (SO_ERROR is one of the usual error codes
listed here, explaining the reason for the failure).

10.10.20.86:9190 is waiting for a connection and once the connection is received, it send the ack back to the source. I see the tcp established - ACK,SYN/ACK and ACK to destination - in pcap but still couldn't figure out why it returns -1 with error. So Am I checking the rc before the connection establishment is complete?
Of course you are. You're checking it immediately connect() returns. As you have put the socket into non-blocking mode, there is no chance the three-way wire handshake will have completed by then.
sysctl net.ipv4.tcp_syn_retries is set to 6.
Irrelevant.
Is there anything wrong with the code above?
Only that it doesn't make sense.
If you want he connection complete or failed before connect() returns, don't use non-blocking mode.
If you want to use non-blocking mode, you have to use select() to tell you when the connect attempt has completed. Select for the socket becoming writeable. (That doesn't necessarily mean it has become writeable: it means the connect attempt has completed, with a result you can discover via getsockopt()/SO_ERROR.)

Related

Can socket send fail cause a daemon program crash?

I have two applications running on Embedded Linux board. One runs as a daemon and other acts as an interface for it. They communicate with each other using Unix sockets.
As to handle any abnormal termination of socket, I tried terminating the interface application [ctr+c]. But as a result, the daemon application crashes. Since the socket is terminated, I get the socket send failed error on daemon side, which is expected but after that the daemon crashes.
I am at a loss as to where exactly should I look for debugging this problem.
Have you set the socket in your daemon to non-blocking mode ?
Suppose your code looks like the following:
while(1)
{
connfd = accept(listenfd, (struct sockaddr*)NULL, NULL);
/* then you use the fd */
func(connfd);
}
Based on the man page:
"
On success, accept() return a nonnegative integer that is a descriptor for the accepted socket. On error, -1 is returned, and errno is set appropriately.
and
If no pending connections are present on the queue, and the socket is not marked as nonblocking, accept() blocks the caller until a connection is present. If the socket is marked nonblocking and no pending connections are present on the queue, accept() fails with the error EAGAIN or EWOULDBLOCK.
"
Therefore, it means if you are in non-blocking mode, you should check the return value of accept() instead of using it directly because the fd value would be -1.
The above is just one common possibility. If it is not the case, you can try to use "sudo strace -p process_id" or carry out the core file analysis to understand why it is crashed.

How to debug tcp socket connection in C?

I have a program in C
I want to connect to the socket with the address 0xAC101067 port 3333 (172.16.16.103:3333)
but it always connected failed and always get -1 of the result
connect(device_info->cloud_fd, &addr, sizeof(addr))
what I known from the API it said 0 is success and -1 is fail,
So how to find out the problem in this program?
if (device_info -> cloud_fd == -1 && (u32) cloud_ip_addr > 0) {
device_info -> cloud_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
setsockopt(device_info -> cloud_fd, 0, SO_BLOCKMODE, & opt, 4);
cloud_ip_addr = 0xAC101067;
addr.s_ip = cloud_ip_addr;
addr.s_port = 3333;//device_info->conf.server_port;
printf("device_info->cloud_fd=%d\r\n", device_info -> cloud_fd);
if (connect(device_info -> cloud_fd, & addr,sizeof(addr))!=0)
goto cloud_error;
}
Fetch appropriate errno as elucidated in the answer from
#LPs. The problem may not be in the code but external.
Get the tcpdump between the client and the server. See what
transpires in the wire. This capture is indispensable in debugging why clients fail to connect. May be the server is not reachable and connect times out(ETIMEDOUT) Or there is no one listening on the said port destination machine(ECONNREFUSED).
ERRNO DESCRIPTION
**errno - number of last error**
SYNOPSIS
#include <errno.h>
DESCRIPTION
The <errno.h> header file defines the integer variable errno, which
is set by system calls and some library functions in the event of an
error to indicate what went wrong. Its value is significant only
when the return value of the call indicated an error (i.e., -1 from
most system calls; -1 or NULL from most library functions); a
function that succeeds is allowed to change errno.
Valid error numbers are all nonzero; errno is never set to zero by
any system call or library function.
FOR YOUR FUNCTION
RETURN VALUE
If the connection or binding succeeds, zero is returned. On error,
-1 is returned, and errno is set appropriately.
ERRORS
The following are general socket errors only. There may be other
domain-specific error codes.
EACCES For UNIX domain sockets, which are identified by pathname:
Write permission is denied on the socket file, or search
permission is denied for one of the directories in the path
prefix. (See also path_resolution(7).)
EACCES, EPERM
The user tried to connect to a broadcast address without
having the socket broadcast flag enabled or the connection
request failed because of a local firewall rule.
EADDRINUSE
Local address is already in use.
EADDRNOTAVAIL
(Internet domain sockets) The socket referred to by sockfd had
not previously been bound to an address and, upon attempting
to bind it to an ephemeral port, it was determined that all
port numbers in the ephemeral port range are currently in use.
See the discussion of /proc/sys/net/ipv4/ip_local_port_range
in ip(7).
EAFNOSUPPORT
The passed address didn't have the correct address family in
its sa_family field.
EAGAIN Insufficient entries in the routing cache.
EALREADY
The socket is nonblocking and a previous connection attempt
has not yet been completed.
EBADF The file descriptor is not a valid index in the descriptor
table.
ECONNREFUSED
No-one listening on the remote address.
EFAULT The socket structure address is outside the user's address
space.
EINPROGRESS
The socket is nonblocking and the connection cannot be
completed immediately. It is possible to select(2) or poll(2)
for completion by selecting the socket for writing. After
select(2) indicates writability, use getsockopt(2) to read the
SO_ERROR option at level SOL_SOCKET to determine whether
connect() completed successfully (SO_ERROR is zero) or
unsuccessfully (SO_ERROR is one of the usual error codes
listed here, explaining the reason for the failure).
EINTR The system call was interrupted by a signal that was caught;
see signal(7).
EISCONN
The socket is already connected.
ENETUNREACH
Network is unreachable.
ENOTSOCK
The file descriptor is not associated with a socket.
EPROTOTYPE
The socket type does not support the requested communications
protocol. This error can occur, for example, on an attempt to
connect a UNIX domain datagram socket to a stream socket.
ETIMEDOUT
Timeout while attempting connection. The server may be too
busy to accept new connections. Note that for IP sockets the
timeout may be very long when syncookies are enabled on the
server.

Linux, sockets, non-blocking connect

I want to create a non-blocking connect.
Like this:
socket.connect(); // returns immediately
For this, I use another thread, an infinite loop and Linux epoll. Like this(pseudocode):
// in another thread
{
create_non_block_socket();
connect();
epoll_create();
epoll_ctl(); // subscribe socket to all events
while (true)
{
epoll_wait(); // wait a small time(~100 ms)
check_socket(); // check on EPOLLOUT event
}
}
If I run a server and then a client, all it works. If I first run a client, wait a some small time, run a server, then the client doesn't connect.
What am I doing wrong? Maybe it can be done differently?
You should use the following steps for an async connect:
create socket with socket(..., SOCK_NONBLOCK, ...)
start connection with connect(fd, ...)
if return value is neither 0 nor EINPROGRESS, then abort with error
wait until fd is signalled as ready for output
check status of socket with getsockopt(fd, SOL_SOCKET, SO_ERROR, ...)
done
No loops - unless you want to handle EINTR.
If the client is started first, you should see the error ECONNREFUSED in the last step. If this happens, close the socket and start from the beginning.
It is difficult to tell what's wrong with your code, without seeing more details. I suppose, that you do not abort on errors in your check_socket operation.
There are a few ways to test if a nonblocking connect succeeds.
call getpeername() first, if it failed with error ENOTCONN, the connection failed. then call getsockopt with SO_ERROR to get the pending error on the socket
call read with a length of 0. if the read failed, the connection failed, and the errno for read indicates why the connection failed; read returns 0 if connection succeeds
call connect again; if the errno is EISCONN, the connection is already connected and the first connect succeeded.
Ref: UNIX Network Programming V1
D. J. Bernstein gathered together various methods how to check if an asynchronous connect() call succeeded or not. Many of these methods do have drawbacks on certain systems, so writing portable code for that is unexpected hard. If anyone want to read all the possible methods and their drawbacks, check out this document.
For those who just want the tl;dr version, the most portable way is the following:
Once the system signals the socket as writable, first call getpeername() to see if it connected or not. If that call succeeded, the socket connected and you can start using it. If that call fails with ENOTCONN, the connection failed. To find out why it failed, try to read one byte from the socket read(fd, &ch, 1), which will fail as well but the error you get is the error you would have gotten from connect() if it wasn't non-blocking.

Why doesn't client's close() of socket cause server's select() to return

[I asked something similar before. This is a more focused version.]
What can cause a server's select() call on a TCP socket to consistently time-out rather than "see" the client's close() of the socket? On the client's side, the socket is a regular socket()-created blocking socket that successfully connects to the server and successfully transmits a round-trip transaction. On the server's side, the socket is created via an accept() call, is blocking, is passed to a child server process via fork(), is closed by the top-level server, and is successfully used by the child server process in the initial transaction. When the client subsequently closes the socket, the select() call of the child server process consistently times-out (after 1 minute) rather than indicating a read-ready condition on the socket. The select() call looks for read-ready conditions only: the write-ready and exception arguments are NULL.
Here's the simplified but logically equivalent select()-using code in the child server process:
int one_svc_run(
const int sock,
const unsigned timeout)
{
struct timeval timeo;
fd_set fds;
timeo.tv_sec = timeout;
timeo.tv_usec = 0;
FD_ZERO(&fds);
FD_SET(sock, &fds);
for (;;) {
fd_set readFds = fds;
int status = select(sock+1, &readFds, 0, 0, &timeo);
if (status < 0)
return errno;
if (status == 0)
return ETIMEDOUT;
/* This code not reached when client closes socket */
/* The time-out structure, "timeo", is appropriately reset here */
...
}
...
}
Here's the logical equivalent of the sequence of events on the client-side (error-handling not shown):
struct sockaddr_in *raddr = ...;
int sock = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
(void)bindresvport(sock, (struct sockaddr_in *)0);
connect(sock, (struct sockaddr *)raddr, sizeof(*raddr));
/* Send a message to the server and receive a reply */
(void)close(sock);
fork(), exec(), and system() are never called. The code is considerably more complex than this, but this is the sequence of relevant calls.
Could Nagel's algorithm cause the FIN packet to not be sent upon close()?
Most likely explanation is that you're not actually closing the client end of the connection when you think you are. Probably because you have some other file descriptor that references the client socket somewhere that is not being closed.
If your client program ever does a fork (or related calls that fork, such as system or popen), the forked child might well have a copy of the file descriptor which would cause the behavior you're seeing.
One way to test/workaround the problem is to have the client do an explicit shutdown(2) prior to closing the socket:
shutdown(sock, SHUT_RDWR);
close(sock);
If this causes the problem to go away then that is the problem -- you have another copy of the client socket file descriptor somewhere hanging around.
If the problem is due to children getting the socket, the best fix is probably to set the close-on-exec flag on the socket immediately after creating it:
fcntl(sock, F_SETFD, fcntl(sock, F_GETFD) | FD_CLOEXEC);
or on some systems, use the SOCK_CLOEXEC flag to the socket creation call.
Mystery solved.
#nos was correct in the first comment: it's a firewall problem. A shutdown() by the client isn't needed; the client does close the socket; the server does use the right timeout; and there's no bug in the code.
The problem was caused by the firewall rules on our Linux Virtual Server (LVS). A client connects to the LVS and the connection is passed to the least-loaded of several backend servers. All packets from the client pass through the LVS; all packets from the backend server go directly to the client. The firewall rules on the LVS caused the FIN packet from the client to be discarded. Thus, the backend server never saw the close() by the client.
The solution was to remove the "-m state --state NEW" options from the iptables(8) rule on the LVS system. This allows the FIN packets from the client to be forwarded to the backend server. This article has more information.
Thanks to all of you who suggested using wireshark(1).
select() call of Linux will modify value of timeout argument. From man page:
On Linux, select() modifies timeout to reflect the amount of time not
slept
So your timeo will runs to zero. And when it is zero select will return immediately (mostly with return value zero).
The following change may help:
for (;;) {
struct timeval timo = timeo;
fd_set readFds = fds;
int status = select(sock+1, &readFds, 0, 0, &timo);

select says socket is ready to read when it is definitely not (actually is closed already)

In my server I check if any socket is ready to read using select() to determine it. As a result in main loop select() is executed every time it iterates.
To test the server I wrote a simple client that sends only one message and then quits. BTW. I use protocol buffers to send information - message means an object of type class Message in this library.
The test session looks like:
select()
server's socket ready to read
accept() client's socket
read message from client's socket
select()
server's socket not ready to read, client's one ready
read message from client's socket
The last step is wrong because client has already closed connection. As a result protobuf library gets Segmentation fault. I wonder why FD_ISSET says the socket is ready in step 6 when it is closed. How can I check if a socket is closed?
EDIT:
I've found how to check if the socket is open
int error = 0;
socklen_t len = sizeof (error);
int retval = getsockopt (socket_fd, SOL_SOCKET, SO_ERROR, &error, &len );
the socket is "readable" if the remote peer closes it, you need to call recv and handle both the case where it returns an error, and the case where it returns 0, which indicates that the peer shut down the connection in an orderly fashion.
Reading the SO_ERROR sockopt is not the correct way, as it returns the current pending error (from, eg. a non-blocking connect)
The socket used for communication between a client and your server will be flagged as readable (i.e. select() will return) when there is data to read, or when there's an EOF to read (i.e. the peer closed the connection).
Just read() when select() returns and your fd is flagged. If read() returns a positive number, you got data. If it returns 0, you got EOF. If it returns -1, you have a problem (unless errno is EAGAIN).

Resources