I am trying to build a system that opens parallel TCP sockets using threads.
My threads are triggered using message queue IPC , thus every time a packet arrive to the message queue a thread "wakes up" , open TCP connection with remote server and send the packet.
My problem is that in Wireshark , I can see the the time it takes to send a file is smaller using threads instead of one connection , but the throughput does not change.
My questions are :
How can i verify my threads working parallel?
How can i improve this code?,
3.How can i open several sockets using one thread?
I am using Virtual machine to run the multithreaded clients.
The IDE I am using is Clion , language is C.
My code:
#include<stdio.h>
#include<stdlib.h>
#include<sys/socket.h>
#include<string.h>
#include <arpa/inet.h>
#include <unistd.h> // for close
#include<pthread.h>
#include <math.h>
#include<malloc.h>
#include<signal.h>
#include<stdbool.h>
#include<sys/types.h>
#include<linux/if_packet.h>
#include<netinet/in.h>
#include<netinet/if_ether.h> // for ethernet header
#include<netinet/ip.h> // for ip header
#include<netinet/udp.h> // for udp header
#include<netinet/tcp.h>
#include <byteswap.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <mqueue.h>
#include <assert.h>
#include <time.h>
#define QUEUE_NAME "/ServerDan_Queue"
#define QUEUE_PERM 0660
#define MAX_MESSAGES 10 //Max size = 10
#define MAX_MSG_SIZE 4105 //Max size = 8192B
#define MSG_BUFFER_SIZE MAX_MSG_SIZE+10
#define BSIZE 1024
#define Nbytes 4096
#define ElorServer_addr "192.168.1.54"
///params:
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
struct sockaddr_in server;
struct stat obj;
int sock;
int k, size, status;
int i = 0;
typedef struct frag
{
int packet_number;
int seq;
uint8_t data[4096];
bool lastfrag;
} fragma;
void * middlemanThread(void *arg)
{
///========================================///
///**** Waiting for message queue trigger!:///
///=======================================///
long id = (long)arg;
id+=1;
mqd_t qd; //queue descriptor
//open the queue for reading//
qd= mq_open(QUEUE_NAME,O_RDONLY);
assert(qd != -1);
struct mq_attr attr;
assert(mq_getattr(qd,&attr) != -1);
uint8_t *income_buf = calloc(attr.mq_msgsize,1);
uint8_t *cast_buf = calloc(attr.mq_msgsize,1);
assert(income_buf);
fragma frag;
struct timespec timeout;
clock_gettime(CLOCK_REALTIME,&timeout);
timeout.tv_sec+=50;
//bool closesoc =false;
printf("Waiting for messages ..... \n\n");
while(1){
///========================================///
///**** Open message queue fo receive:///
///=======================================///
if((mq_timedreceive(qd,income_buf,attr.mq_msgsize,0,&timeout))<0){
printf("Failed to receive message for 50 sec \n");
//closesoc =true;
pthread_exit(NULL);
}
else{
cast_buf = income_buf;
printf("Received successfully , your msg :\n");
frag.packet_number = *cast_buf;
cast_buf = (cast_buf + sizeof(int));
frag.seq = *cast_buf;
cast_buf = (cast_buf + sizeof(int));
memccpy(frag.data,((fragma*)cast_buf)->data,0,Nbytes);
cast_buf = cast_buf + Nbytes;
frag.lastfrag = *cast_buf;
uint8_t * data = frag.data;
}
pthread_mutex_lock(&lock);
///========================================///
///**** Connecting to Server and send Frament:///
///=======================================///
int size = sizeof(( fragma *)income_buf)->packet_number + sizeof(( fragma *)income_buf)->seq + sizeof(( fragma *)income_buf)->data + sizeof(( fragma *)income_buf)->lastfrag;
printf("In thread\n");
int clientSocket;
struct sockaddr_in serverAddr;
socklen_t addr_size;
// Create the socket.
clientSocket = socket(PF_INET, SOCK_STREAM, 0);
//Configure settings of the server address
// Address family is Internet
serverAddr.sin_family = AF_INET;
//Set port number, using htons function
serverAddr.sin_port = htons(8081);
//Set IP address to localhost
serverAddr.sin_addr.s_addr = inet_addr("192.168.14.149");
memset(serverAddr.sin_zero, '\0', sizeof serverAddr.sin_zero);
//Connect the socket to the server using the address
addr_size = sizeof serverAddr;
connect(clientSocket, (struct sockaddr *) &serverAddr, addr_size);
if(send(clientSocket , income_buf , size,0) < 0)
{
printf("Send failed\n");
}
printf("Trhead Id : %ld \n" , id);
printf("Packet number : %d \n Seq = %d \n lasfrag = %d\n\n",frag.packet_number,frag.seq,(int)frag.lastfrag);
pthread_mutex_unlock(&lock);
//if(closesoc)
close(clientSocket);
usleep(20000);
}
}
int main(){
int i = 0;
pthread_t tid[5];
while(i< 5)
{
if( pthread_create(&tid[i], NULL, middlemanThread, (void*)i) != 0 )
printf("Failed to create thread\n");
i++;
}
sleep(2);
i = 0;
while(i< 5)
{
pthread_join(tid[i++],NULL);
printf("Thread ID : %d:\n",i);
}
return 0;
}
thus every time a packet arrive to the message queue a thread "wakes up" , open TCP connection with remote server and send the packet
If you're at all concerned about speed or efficiency, don't do this. The single most expensive thing you can do with a TCP socket is the initial connection. You're doing a 3-way handshake just to send a single message!
Then, you're holding a global mutex while doing this entire operation - which, again, is the slowest operation in your program.
The current design is effectively single-threaded, but in the most complicated and expensive possible way.
I can see the the time it takes to send a file is smaller using threads instead of one connection , but the throughput does not change
I have no idea what you're actually measuring, and it's not at all clear that you do either. What is a file? One fragment? Multiple fragments? How big is it compared to your MTU? Have you checked that the fragments are actually received in the correct order, because it looks to me like the only possible parallelism is the spot where that could break.
How is it possible to have lower latency and unchanged throughput for a single file?
How can i verify my threads working parallely?
If you see multiple TCP connections in wireshark with different source ports, and their packets are interleaved, you have effective parallelism. This is unlikely though as you explicitly prohibited it with your global mutex!
What is the best way to check the throughput in wireshark?
Don't. Use wireshark for inspecting packets, use the server to determine throughput. That's where the results actually matter.
3.Is the concept of parallel TCP suppose to increase the throughput?
Why did you implement all this complexity if you don't know what it's for?
There's a good chance a single thread (correctly coded with no spurious mutex thrashing) can saturate your network, so: no. Having multiple I/O threads is generally about conveniently partitioning your logic and state (ie, having one client per thread, or different unrelated I/O subsystems in different threads), rather than performance.
If you want to pull packets off a message queue and send them to TCP, the performant way is to:
use a single thread just doing this (your program may have other threads doing other things - avoid synchronizing with them if possible)
open a single persistent TCP connection to the server and don't connect/close it for every fragment
that's it. It's much simpler than what you have and will perform much better.
You can realistically have one thread handling multiple different connections, but I can't see any way this would be useful in your case, so keep it simple.
Here is a partial answer to:
3.Is the concept of parallel TCP suppose to increase the throughput?
kinda. it really depends on what the bottleneck is.
First possible bottleneck is cogestion control. TCP sender has a limit on how much packets can be sent at once (before an ACK for the first of the bunch is received), called congestion window. This number should start small and grow over time. Also, if a packet is lost this number is reduced by half and then grow slowly back untill next drop occurs. The limit is however per one TCP connection, so if you spread your data over several parallel connections the overall congestion window (sum of all windows of all flows) will grow faster and drop for lesser amount. (this is a summary, for details you need to know how congestion control works, and it is a big topic). This should happen irrespectively of whether you are using threads or not. You can open several connections in one thread, and, achieve the same effect.
Second possible bottleneck is the network processing in the OS. AFAIK this is an issue starting with 10Gb connections. Maybe 1Gb, but probably not. TCP processing happens in OS not in your application. You may achieve better performance if the processing is spread between processors by the OS (there should be parameters to enable this), and maybe a little better performance because of caches.
If you are reading files from disk, you disk IO can also very well be a bottleneck. In this case I don't think that spreading sending data between different threads is actually going to help.
Related
When a process runs out of file descriptors, accept() will fail and set errno to EMFILE.
However the underlying connection that would have been accepted are not closed, so there appears to be no way to inform the client that the application code could not handle the connection.
The question is what is the proper action to take regarding accepting TCP connections when running out of file descriptors.
The following code demonstrates the issue that I want to learn how to best deal with(note this is just example code for demonstrating the issue/question, not production code)
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
static void err(const char *str)
{
perror(str);
exit(1);
}
int main(int argc,char *argv[])
{
int serversocket;
struct sockaddr_in serv_addr;
serversocket = socket(AF_INET,SOCK_STREAM,0);
if(serversocket < 0)
err("socket()");
memset(&serv_addr,0,sizeof serv_addr);
serv_addr.sin_family = AF_INET;
serv_addr.sin_addr.s_addr= INADDR_ANY;
serv_addr.sin_port = htons(6543);
if(bind(serversocket,(struct sockaddr*)&serv_addr,sizeof serv_addr) < 0)
err("bind()");
if(listen(serversocket,10) < 0)
err("listen()");
for(;;) {
struct sockaddr_storage client_addr;
socklen_t client_len = sizeof client_addr;
int clientfd;
clientfd = accept(serversocket,(struct sockaddr*)&client_addr,&client_len);
if(clientfd < 0) {
continue;
}
}
return 0;
}
Compile and run this code with a limited number of file descriptors available:
gcc srv.c
ulimit -n 10
strace -t ./a.out 2>&1 |less
And in another console, I run
telnet localhost 65432 &
As many times as needed until accept() fails:
The output from strace shows this to happen:
13:21:12 socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
13:21:12 bind(3, {sa_family=AF_INET, sin_port=htons(6543), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
13:21:12 listen(3, 10) = 0
13:21:12 accept(3, {sa_family=AF_INET, sin_port=htons(43630), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 4
13:21:19 accept(3, {sa_family=AF_INET, sin_port=htons(43634), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 5
13:21:22 accept(3, {sa_family=AF_INET, sin_port=htons(43638), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 6
13:21:23 accept(3, {sa_family=AF_INET, sin_port=htons(43642), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 7
13:21:24 accept(3, {sa_family=AF_INET, sin_port=htons(43646), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 8
13:21:26 accept(3, {sa_family=AF_INET, sin_port=htons(43650), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 9
13:21:27 accept(3, 0xbfe718f4, [128]) = -1 EMFILE (Too many open files)
13:21:27 accept(3, 0xbfe718f4, [128]) = -1 EMFILE (Too many open files)
13:21:27 accept(3, 0xbfe718f4, [128]) = -1 EMFILE (Too many open files)
13:21:27 accept(3, 0xbfe718f4, [128]) = -1 EMFILE (Too many open files)
... and thousands upon thousands of more accept() failures.
Basically at this point:
the code will call accept() as fast as possible failing to accept the same TCP connection over and over again, churning CPU.
the client will stay connected, (as the TCP handshake completes before the application accepts the connection) and the client gets no information that there is an issue.
So,
Is there a way to force the TCP connection that caused accept() to fail to be closed (so e.g. the client can be quickly informed and perhaps try another server )
What is the est practice to prevent the server code to go into an infinite loop when this situation arises (or to prevent the situation altogether)
You can set aside an extra fd at the beginning of your program and keep track of the EMFILE condition:
int reserve_fd;
_Bool out_of_fd = 0;
if(0>(reserve_fd = dup(1)))
err("dup()");
Then, if you hit the EMFILE condition, you can close the reserve_fd and use its slot to accept the new connection (which you'll then immediately close):
clientfd = accept(serversocket,(struct sockaddr*)&client_addr,&client_len);
if (out_of_fd){
close(clientfd);
if(0>(reserve_fd = dup(1)))
err("dup()");
out_of_fd=0;
continue; /*doing other stuff that'll hopefully free the fd*/
}
if(clientfd < 0) {
close(reserve_fd);
out_of_fd=1;
continue;
}
Complete example:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
static void err(const char *str)
{
perror(str);
exit(1);
}
int main(int argc,char *argv[])
{
int serversocket;
struct sockaddr_in serv_addr;
serversocket = socket(AF_INET,SOCK_STREAM,0);
if(serversocket < 0)
err("socket()");
int yes;
if ( -1 == setsockopt(serversocket, SOL_SOCKET, SO_REUSEADDR, &yes, sizeof(int)) )
perror("setsockopt");
memset(&serv_addr,0,sizeof serv_addr);
serv_addr.sin_family = AF_INET;
serv_addr.sin_addr.s_addr= INADDR_ANY;
serv_addr.sin_port = htons(6543);
if(bind(serversocket,(struct sockaddr*)&serv_addr,sizeof serv_addr) < 0)
err("bind()");
if(listen(serversocket,10) < 0)
err("listen()");
int reserve_fd;
int out_of_fd = 0;
if(0>(reserve_fd = dup(1)))
err("dup()");
for(;;) {
struct sockaddr_storage client_addr;
socklen_t client_len = sizeof client_addr;
int clientfd;
clientfd = accept(serversocket,(struct sockaddr*)&client_addr,&client_len);
if (out_of_fd){
close(clientfd);
if(0>(reserve_fd = dup(1)))
err("dup()");
out_of_fd=0;
continue; /*doing other stuff that'll hopefully free the fd*/
}
if(clientfd < 0) {
close(reserve_fd);
out_of_fd=1;
continue;
}
}
return 0;
}
If you're multithreaded, then I imagine you'd need a lock around fd-producing functions and take it when you close the extra fd (while expecting to accept the final connection) in order to prevent having the spare slot filled by another thread.
All this should only makes sense if 1) the listening socket isn't shared with other processes (which might not have hit their EMFILE limit yet) and 2) the server deals with persistent connections (because if it doesn't, then you're bound to close some existing connection very soon, freeing up a fd slot for your next attempt at accept).
Problem
You cannot accept client connections, if the maximum number of file descriptors is reached. This can be a process limit (errno EMFILE) or a global system limit (errno ENFILE). The client does not immediately notice this situation and it looks to him like the connection was accepted by the server. If too many such connections pile up on the socket (when the backlog runs full), the server will stop sending syn-ack packets and the connection request will time out at the client (which can be quite an annoying delay)
Number of file descriptors
It is of course possible, to extend both limits when they are hit. For the process wide limit, use setrlimit(RLIMIT_NOFILE, ...), for the system wide limit sysctl() is the command to call. Both may require root privileges, the first one only to rise the hard limit.
However, there usually is a good reason for the file descriptor limit to prevent overusage of system resources, so this will not be a solution for all situations.
Recovering from EMFILE
One option is to implement a sleep(n) after EMFILE is received, one second should be enough to prevent additional system load by calling accept() too often. This may be useful to handle short bursts of connections.
However, if the situation doesn't normalize soon, other measures should be taken (for example, if sleep() had to be called 5 times in a row or similar).
In this case it is advisable to close the server socket. All pending client connections will be terminated immediately (by receiving a RST packet) and the clients can use another server if applicable. Furthermore, no new client connections are accepted, but immediately rejected (Connection Refused) instead of timing out as it might happen when the socket is held open.
After the contention releases, the server socket can be opened again. For the EMFILE case it is only necessary to track the number of open client connections and re-open the server socket, when these fall below some threshold. In the system-wide case, there is not a general answer for that, maybe just try after some time or use the /proc filesystem or system tools like lsof to find out when the contention ceases.
One solution I've read about is to keep a "spare" file descriptor handy that you can use to accept and immediately close new connections when you're over fd capacity. For example:
int sparefd = open("/dev/null", O_RDONLY);
Then, when accept returns with EMFILE, you can:
close(sparefd); // create an available file descriptor
int newfd = accept(...); // accept a new connection
close(newfd); // immediately close the connection
sparefd = open("/dev/null", O_RDONLY); // re-create spare
It's not exactly elegant, but it's probably a little better than closing the listening socket in some circumstances. Be wary that if your program is multi-threaded then another thread might "claim" the spare fd as soon as you release it; there's no easy way to solve that (the "hard" way is to put a mutex around every operation that might consume a file descriptor).
I have a client server program written in C. The intent is to see how fast big data can be trasported over TCP. The receiving side OS (Ubuntu Linux 14.*) is tuned to improve the TCP performance, as per the documentation around tcp / socket / windows scaling etc. as below:
net.ipv4.tcp_window_scaling = 1
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 16384 16777216
This aprt, I have also increased the individual socket buffer size through setsockopt call.
But I am not seeing the program responding to these changes - the overall throughput is either flat or even reduced at times. When I took tcpdump at the receiving side, I see a monotonic pattern of tcp packets of length 1368 as coming to it, in most (99%) cases.
19:26:06.531968 IP <SRC> > <DEST>: Flags [.], seq 25993:27361, ack 63, win 57, options [nop,nop,TS val 196975830 ecr 488095483], length 1368
As per the documentation, the tcp window scaling option increases the receiving frame size in propotion to the demand and capacity - but all I see is "win 57" - very few bytes remaining in the receiving buffer, which is not matching with the expection.
Hence I start suspecting my assumptions on the tuning itself, and have these questions:
Is there any specific tunables required at the sending side to improve the client side reception? Making sure that you (program) writes the whole chunk of data in one go is not enough?
In the client side tunable as mentioned above necessary and sufficient? The default on in the system are too low, but I don't see the changes applied in /etc/sysctl.conf having any effect. Is running sysctl --system after changes sufficient to make the changes in effect? or do we need to reboot the system?
If the OS is a virtual machine, will these tunables make meaning in its completeness, or are there additional steps at the real physical machine?
I can share the source code if that helps, but I can guarentee that it is just a trivial code.
Here is the code:
#cat client.c
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <unistd.h>
#include <arpa/inet.h>
#include <string.h>
#define size 1024 * 1024 * 32
int main(){
int s;
char buffer[size];
struct sockaddr_in sa;
socklen_t addr_size;
s = socket(PF_INET, SOCK_STREAM, 0);
sa.sin_family = AF_INET;
sa.sin_port = htons(25000);
sa.sin_addr.s_addr = inet_addr("<SERVERIP");
memset(sa.sin_zero, '\0', sizeof sa.sin_zero);
addr_size = sizeof sa;
connect(s, (struct sockaddr *) &sa, addr_size);
int rbl = 1048576;
int g = setsockopt(s, SOL_SOCKET, SO_RCVBUF, &rbl, sizeof(rbl));
while(1) {
int ret = read(s, buffer, size);
if(ret <= 0) break;
}
return 0;
}
And the server code:
bash-4.1$ cat server.c
#include <sys/types.h>
#include <sys/mman.h>
#include <memory.h>
#include <stdlib.h>
#include <fcntl.h>
#include <sys/stat.h>
#include <netinet/in.h>
#include <errno.h>
#include <stdio.h>
#include <sys/socket.h>
extern int errno;
#define size 32 * 1024 * 1024
int main() {
int fdsocket;
struct sockaddr_in sock;
fdsocket = socket(AF_INET,SOCK_STREAM, 0);
int rbl = 1048576;
int g = setsockopt(fdsocket, SOL_SOCKET, SO_SNDBUF, &rbl, sizeof(rbl));
sock.sin_family = AF_INET;
sock.sin_addr.s_addr = inet_addr("<SERVERIP");
sock.sin_port = htons(25000);
memset(sock.sin_zero, '\0', sizeof sock.sin_zero);
g = bind(fdsocket, (struct sockaddr *) &sock, sizeof(sock));
if(g == -1) {
fprintf(stderr, "bind error: %d\n", errno);
exit(1);
}
int p = listen(fdsocket, 1);
char *buffer = (char *) mmap(NULL, size, PROT_WRITE|PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
if(buffer == -1) {
fprintf(stderr, "%d\n", errno);
exit(-1);
}
memset(buffer, 0xc, size);
int connfd = accept(fdsocket, (struct sockaddr*)NULL, NULL);
rbl = 1048576;
g = setsockopt(connfd, SOL_SOCKET, SO_SNDBUF, &rbl, sizeof(rbl));
int wr = write(connfd, buffer, size);
close(connfd);
}
There are many tunables, but whether they have an effect and whether the effect is positive or negative also depends on the situation. What are the defaults for the tunables? The values you set might actually be lower than the defaults on your OS, thereby decreasing performance. But larger buffers might sometimes also be detrimental, because more RAM is used, and it might not fit into cache memory anymore. It also depends on your network itself. Is it wired, wireless, how many hops, what kind of routers are inbetween? But sending data in as large chunks as possible is usually the right thing to do.
One tunable you have missed is the congestion control algorithm, which you can tune with net.ipv4.tcp_congestion_control. Which ones are available depends on your kernel, and which one is the best depends on your network and the kind of traffic that you are sending.
Another thing is that TCP has two endpoints, and tunables on both sides are important.
The changes made with sysctl are taking effect immediately for new TCP connections.
The TCP parameters only have effect on the endpoints of a TCP connection. So you don't have to change them on the VM host. But running in a guest means that the packets it sends still need to be processed by the host in some way (if only just to forward them to the real physical network interface). It will always be slower to run your test from inside a virtual machine than if you'd run it on a physical machine.
What I'm missing is any benchmark numbers that you can compare with the actual network speed. Is there room for improvement at all? Maybe you are already at the maximum speed that is possible? In that case no amount of tuning will help. Note that the defaults are normally very reasonable.
Why is the following code slow? And by slow I mean 100x-1000x slow. It just repeatedly performs read/write directly on a TCP socket. The curious part is that it remains slow only if I use two function calls for both read AND write as shown below. If I change either the server or the client code to use a single function call (as in the comments), it becomes super fast.
Code snippet:
int main(...) {
int sock = ...; // open TCP socket
int i;
char buf[100000];
for(i=0;i<2000;++i)
{ if(amServer)
{ write(sock,buf,10);
// read(sock,buf,20);
read(sock,buf,10);
read(sock,buf,10);
}else
{ read(sock,buf,10);
// write(sock,buf,20);
write(sock,buf,10);
write(sock,buf,10);
}
}
close(sock);
}
We stumbled on this in a larger program, that was actually using stdio buffering. It mysteriously became sluggish the moment payload size exceeded the buffer size by a small margin. Then I did some digging around with strace, and finally boiled the problem down to this. I can solve this by fooling around with buffering strategy, but I'd really like to know what on earth is going on here. On my machine, it goes from 0.030 s to over a minute on my machine (tested both locally and over remote machines) when I change the two read calls to a single call.
These tests were done on various Linux distros, and various kernel versions. Same result.
Fully runnable code with networking boilerplate:
#include <netdb.h>
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <netinet/ip.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
static int getsockaddr(const char* name,const char* port, struct sockaddr* res)
{
struct addrinfo* list;
if(getaddrinfo(name,port,NULL,&list) < 0) return -1;
for(;list!=NULL && list->ai_family!=AF_INET;list=list->ai_next);
if(!list) return -1;
memcpy(res,list->ai_addr,list->ai_addrlen);
freeaddrinfo(list);
return 0;
}
// used as sock=tcpConnect(...); ...; close(sock);
static int tcpConnect(struct sockaddr_in* sa)
{
int outsock;
if((outsock=socket(AF_INET,SOCK_STREAM,0))<0) return -1;
if(connect(outsock,(struct sockaddr*)sa,sizeof(*sa))<0) return -1;
return outsock;
}
int tcpConnectTo(const char* server, const char* port)
{
struct sockaddr_in sa;
if(getsockaddr(server,port,(struct sockaddr*)&sa)<0) return -1;
int sock=tcpConnect(&sa); if(sock<0) return -1;
return sock;
}
int tcpListenAny(const char* portn)
{
in_port_t port;
int outsock;
if(sscanf(portn,"%hu",&port)<1) return -1;
if((outsock=socket(AF_INET,SOCK_STREAM,0))<0) return -1;
int reuse = 1;
if(setsockopt(outsock,SOL_SOCKET,SO_REUSEADDR,
(const char*)&reuse,sizeof(reuse))<0) return fprintf(stderr,"setsockopt() failed\n"),-1;
struct sockaddr_in sa = { .sin_family=AF_INET, .sin_port=htons(port)
, .sin_addr={INADDR_ANY} };
if(bind(outsock,(struct sockaddr*)&sa,sizeof(sa))<0) return fprintf(stderr,"Bind failed\n"),-1;
if(listen(outsock,SOMAXCONN)<0) return fprintf(stderr,"Listen failed\n"),-1;
return outsock;
}
int tcpAccept(const char* port)
{
int listenSock, sock;
listenSock = tcpListenAny(port);
if((sock=accept(listenSock,0,0))<0) return fprintf(stderr,"Accept failed\n"),-1;
close(listenSock);
return sock;
}
void writeLoop(int fd,const char* buf,size_t n)
{
// Don't even bother incrementing buffer pointer
while(n) n-=write(fd,buf,n);
}
void readLoop(int fd,char* buf,size_t n)
{
while(n) n-=read(fd,buf,n);
}
int main(int argc,char* argv[])
{
if(argc<3)
{ fprintf(stderr,"Usage: round {server_addr|--} port\n");
return -1;
}
bool amServer = (strcmp("--",argv[1])==0);
int sock;
if(amServer) sock=tcpAccept(argv[2]);
else sock=tcpConnectTo(argv[1],argv[2]);
if(sock<0) { fprintf(stderr,"Connection failed\n"); return -1; }
int i;
char buf[100000] = { 0 };
for(i=0;i<4000;++i)
{
if(amServer)
{ writeLoop(sock,buf,10);
readLoop(sock,buf,20);
//readLoop(sock,buf,10);
//readLoop(sock,buf,10);
}else
{ readLoop(sock,buf,10);
writeLoop(sock,buf,20);
//writeLoop(sock,buf,10);
//writeLoop(sock,buf,10);
}
}
close(sock);
return 0;
}
EDIT: This version is slightly different from the other snippet in that it reads/writes in a loop. So in this version, two separate writes automatically causes two separate read() calls, even if readLoop is called only once. But otherwise the problem still remains.
Interesting. You are being a victim of the Nagle's algorithm together with TCP delayed acknowledgements.
The Nagle's algorithm is a mechanism used in TCP to defer transmission of small segments until enough data has been accumulated that makes it worth building and sending a segment over the network. From the wikipedia article:
Nagle's algorithm works by combining a number of small outgoing
messages, and sending them all at once. Specifically, as long as there
is a sent packet for which the sender has received no acknowledgment,
the sender should keep buffering its output until it has a full
packet's worth of output, so that output can be sent all at once.
However, TCP typically employs something known as TCP delayed acknowledgements, which is a technique that consists of accumulating together a batch of ACK replies (because TCP uses cumulative ACKS), to reduce network traffic.
That wikipedia article further mentions this:
With both algorithms enabled, applications that do two successive
writes to a TCP connection, followed by a read that will not be
fulfilled until after the data from the second write has reached the
destination, experience a constant delay of up to 500 milliseconds,
the "ACK delay".
(Emphasis mine)
In your specific case, since the server doesn't send more data before reading the reply, the client is causing the delay: if the client writes twice, the second write will be delayed.
If Nagle's algorithm is being used by the sending party, data will be
queued by the sender until an ACK is received. If the sender does not
send enough data to fill the maximum segment size (for example, if it
performs two small writes followed by a blocking read) then the
transfer will pause up to the ACK delay timeout.
So, when the client makes 2 write calls, this is what happens:
Client issues the first write.
The server receives some data. It doesn't acknowledge it in the hope that more data will arrive (so it can batch up a bunch of ACKs in one single ACK).
Client issues the second write. The previous write has not been acknowledged, so Nagle's algorithm defers transmission until more data arrives (until enough data has been collected to make a segment) or the previous write is ACKed.
Server is tired of waiting and after 500 ms acknowledges the segment.
Client finally completes the 2nd write.
With 1 write, this is what happens:
Client issues the first write.
The server receives some data. It doesn't acknowledge it in the hope that more data will arrive (so it can batch up a bunch of ACKs in one single ACK).
The server writes to the socket. An ACK is part of the TCP header, so if you're writing, you might as well acknowledge the previous segment at no extra cost. Do it.
Meanwhile, the client wrote once, so it was already waiting on the next read - there was no 2nd write waiting for the server's ACK.
If you want to keep writing twice on the client side, you need to disable the Nagle's algorithm. This is the solution proposed by the algorithm author himself:
The user-level solution is to avoid write-write-read sequences on
sockets. write-read-write-read is fine. write-write-write is fine. But
write-write-read is a killer. So, if you can, buffer up your little
writes to TCP and send them all at once. Using the standard UNIX I/O
package and flushing write before each read usually works.
(See the citation on Wikipedia)
As mentioned by David Schwartz in the comments, this may not be the greatest idea for various reasons, but it illustrates the point and shows that this is indeed causing the delay.
To disable it, you need to set the TCP_NODELAY option on the sockets with setsockopt(2).
This can be done in tcpConnectTo() for the client:
int tcpConnectTo(const char* server, const char* port)
{
struct sockaddr_in sa;
if(getsockaddr(server,port,(struct sockaddr*)&sa)<0) return -1;
int sock=tcpConnect(&sa); if(sock<0) return -1;
int val = 1;
if (setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &val, sizeof(val)) < 0)
perror("setsockopt(2) error");
return sock;
}
And in tcpAccept() for the server:
int tcpAccept(const char* port)
{
int listenSock, sock;
listenSock = tcpListenAny(port);
if((sock=accept(listenSock,0,0))<0) return fprintf(stderr,"Accept failed\n"),-1;
close(listenSock);
int val = 1;
if (setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &val, sizeof(val)) < 0)
perror("setsockopt(2) error");
return sock;
}
It's interesting to see the huge difference this makes.
If you'd rather not mess with the socket options, it's enough to ensure that the client writes once - and only once - before the next read. You can still have the server read twice:
for(i=0;i<4000;++i)
{
if(amServer)
{ writeLoop(sock,buf,10);
//readLoop(sock,buf,20);
readLoop(sock,buf,10);
readLoop(sock,buf,10);
}else
{ readLoop(sock,buf,10);
writeLoop(sock,buf,20);
//writeLoop(sock,buf,10);
//writeLoop(sock,buf,10);
}
}
According to Wikipedia, a traceroute program
Traceroute, by default, sends a sequence of User Datagram Protocol
(UDP) packets addressed to a destination host[...] The time-to-live
(TTL) value, also known as hop limit, is used in determining the
intermediate routers being traversed towards the destination. Routers
decrement packets' TTL value by 1 when routing and discard packets
whose TTL value has reached zero, returning the ICMP error message
ICMP Time Exceeded.[..]
I started writing a program (using an example UDP program as a guide) to adhere to this specification,
#include <sys/socket.h>
#include <assert.h>
#include <netinet/udp.h> //Provides declarations for udp header
#include <netinet/ip.h> //Provides declarations for ip header
#include <stdio.h>
#include <string.h>
#include <arpa/inet.h>
#include <unistd.h>
#define DATAGRAM_LEN sizeof(struct iphdr) + sizeof(struct iphdr)
unsigned short csum(unsigned short *ptr,int nbytes) {
register long sum;
unsigned short oddbyte;
register short answer;
sum=0;
while(nbytes>1) {
sum+=*ptr++;
nbytes-=2;
}
if(nbytes==1) {
oddbyte=0;
*((u_char*)&oddbyte)=*(u_char*)ptr;
sum+=oddbyte;
}
sum = (sum>>16)+(sum & 0xffff);
sum = sum + (sum>>16);
answer=(short)~sum;
return(answer);
}
char *new_packet(int ttl, struct sockaddr_in sin) {
static int id = 0;
char *datagram = malloc(DATAGRAM_LEN);
struct iphdr *iph = (struct iphdr*) datagram;
struct udphdr *udph = (struct udphdr*)(datagram + sizeof (struct iphdr));
iph->ihl = 5;
iph->version = 4;
iph->tos = 0;
iph->tot_len = DATAGRAM_LEN;
iph->id = htonl(++id); //Id of this packet
iph->frag_off = 0;
iph->ttl = ttl;
iph->protocol = IPPROTO_UDP;
iph->saddr = inet_addr("127.0.0.1");//Spoof the source ip address
iph->daddr = sin.sin_addr.s_addr;
iph->check = csum((unsigned short*)datagram, iph->tot_len);
udph->source = htons(6666);
udph->dest = htons(8622);
udph->len = htons(8); //udp header size
udph->check = csum((unsigned short*)datagram, DATAGRAM_LEN);
return datagram;
}
int main(int argc, char **argv) {
int s, ttl, repeat;
struct sockaddr_in sin;
char *data;
printf("\n");
if (argc != 3) {
printf("usage: %s <host> <port>", argv[0]);
return __LINE__;
}
sin.sin_family = AF_INET;
sin.sin_addr.s_addr = inet_addr(argv[1]);
sin.sin_port = htons(atoi(argv[2]));
if ((s = socket(AF_PACKET, SOCK_RAW, 0)) < 0) {
printf("Failed to create socket.\n");
return __LINE__;
}
ttl = 1, repeat = 0;
while (ttl < 2) {
data = new_packet(ttl);
if (write(s, data, DATAGRAM_LEN) != DATAGRAM_LEN) {
printf("Socket failed to send packet.\n");
return __LINE__;
}
read(s, data, DATAGRAM_LEN);
free(data);
if (++repeat > 2) {
repeat = 0;
ttl++;
}
}
return 0;
}
... however at this point I have a few questions.
Is read(s, data, ... reading whole packets at a time, or do I need to parse the data read from the socket; seeking markers particular to IP packets?
What is the best way to uniquely mark my packets as they return to my box as expired?
Should I set up a second socket with the IPPROTO_ICMP flag, or is it easier to write a filter; accepting everything?
Do any other common mistakes exist; or are any common obstacles foreseeable?
Here are some of my suggestions (based on assumption it's a Linux machine).
read packets
You might want to read whole 1500 byte packets (entire Ethernet frame). Don't worry - smaller frames would still be read completely with read returning the length of data read.
Best way to add marker is to have some UDP payload (a simple unsigned int) should be good enough. Increase it on every packet sent. (I just did a tcpdump on traceroute - the ICMP error - does return an entire IP frame back - so you can look at the returned IP frame, parse the UDP payload and so on. Note your DATAGRAM_LEN would change accordingly. ) Of course you can use ID - but be careful that ID is mainly used by fragmentation. You should be okay with that - 'cos you'd not be approaching fragmentation limit on any intermediate routers with these packet sizes. Generally, not a good idea to 'steal' protocol fields that are meant for something else for our custom purpose.
A cleaner way could be to actually use IPPROTO_ICMP on raw sockets (if manuals are installed on your machine man 7 raw and man 7 icmp). You would not want to receive copy of all packets on your device and ignore those that are not ICMP.
If you are using type SOCKET_RAW on AF_PACKET, you will have to manually attach a link layer header or you can do SOCKET_DGRAM and check. Also man 7 packet for lot of subtleties.
Hope that helps or are you looking at some actual code?
A common pitfall is that programming at this level needs very careful use of the proper include files. For instance, your program as-is won't compile on NetBSD, which is typically quite strict in following relevant standards.
Even when I add some includes, there is no struct iphdr but there is a struct udpiphdr instead.
So for now the rest of my answer is not based on trying your program in practice.
read(2) can be used to read single packets at a time. For packet-oriented protocols, such as UDP, you'll never get more data from it than a single packet.
However you can also use recvfrom(2), recv(2) or recvmsg(2) to receive the packets.
If fildes refers to a socket, read() shall be equivalent to recv()
with no flags set.
To identify the packets, I believe using the id field is typically done, as you have already. I am not sure what you mean with "mark my packets as they return to my box as expired", since your packets don't return to you. What you may get back are ICMP Time Exceeded messages. These usually arrive within a few seconds, if they arrive at all. Sometimes they are not sent, sometimes they may be blocked by misconfigured routers between you and their sender.
Note that this assumes that the IP ID you set up in your packet is respected by the network stack you're using. It is possible that it doesn't, and replaces your chosen ID with a different one. Van Jacobson, the original author of the traceroute command as found in NetBSD therefore use a different method:
* The udp port usage may appear bizarre (well, ok, it is bizarre).
* The problem is that an icmp message only contains 8 bytes of
* data from the original datagram. 8 bytes is the size of a udp
* header so, if we want to associate replies with the original
* datagram, the necessary information must be encoded into the
* udp header (the ip id could be used but there's no way to
* interlock with the kernel's assignment of ip id's and, anyway,
* it would have taken a lot more kernel hacking to allow this
* code to set the ip id). So, to allow two or more users to
* use traceroute simultaneously, we use this task's pid as the
* source port (the high bit is set to move the port number out
* of the "likely" range). To keep track of which probe is being
* replied to (so times and/or hop counts don't get confused by a
* reply that was delayed in transit), we increment the destination
* port number before each probe.
Using a IPPROTO_ICMP socket for receiving the replies is more likely to be efficient than trying to receive all packets. It would also require fewer privileges to do so. Of course sending raw packets normally already requires root, but it could make a difference if a more fine-grained permission system is in use.
I have an application that reads large files from a server and hangs frequently on a particular machine. It has worked successfully under RHEL5.2 for a long time. We have recently upgraded to RHEL6.1 and it now hangs regularly.
I have created a test app that reproduces the problem. It hangs approx 98 times out of 100.
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/param.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <netdb.h>
#include <sys/socket.h>
#include <sys/time.h>
int mFD = 0;
void open_socket()
{
struct addrinfo hints, *res;
memset(&hints, 0, sizeof(hints));
hints.ai_socktype = SOCK_STREAM;
hints.ai_family = AF_INET;
if (getaddrinfo("localhost", "60000", &hints, &res) != 0)
{
fprintf(stderr, "Exit %d\n", __LINE__);
exit(1);
}
mFD = socket(res->ai_family, res->ai_socktype, res->ai_protocol);
if (mFD == -1)
{
fprintf(stderr, "Exit %d\n", __LINE__);
exit(1);
}
if (connect(mFD, res->ai_addr, res->ai_addrlen) < 0)
{
fprintf(stderr, "Exit %d\n", __LINE__);
exit(1);
}
freeaddrinfo(res);
}
void read_message(int size, void* data)
{
int bytesLeft = size;
int numRd = 0;
while (bytesLeft != 0)
{
fprintf(stderr, "reading %d bytes\n", bytesLeft);
/* Replacing MSG_WAITALL with 0 works fine */
int num = recv(mFD, data, bytesLeft, MSG_WAITALL);
if (num == 0)
{
break;
}
else if (num < 0 && errno != EINTR)
{
fprintf(stderr, "Exit %d\n", __LINE__);
exit(1);
}
else if (num > 0)
{
numRd += num;
data += num;
bytesLeft -= num;
fprintf(stderr, "read %d bytes - remaining = %d\n", num, bytesLeft);
}
}
fprintf(stderr, "read total of %d bytes\n", numRd);
}
int main(int argc, char **argv)
{
open_socket();
uint32_t raw_len = atoi(argv[1]);
char raw[raw_len];
read_message(raw_len, raw);
return 0;
}
Some notes from my testing:
If "localhost" maps to the loopback address 127.0.0.1, the app hangs on the call to recv() and NEVER returns.
If "localhost" maps to the ip of the machine, thus routing the packets via the ethernet interface, the app completes successfully.
When I experience a hang, the server sends a "TCP Window Full" message, and the client responds with a "TCP ZeroWindow" message (see image and attached tcpdump capture). From this point, it hangs forever with the server sending keep-alives and the client sending ZeroWindow messages. The client never seems to expand its window, allowing the transfer to complete.
During the hang, if I examine the output of "netstat -a", there is data in the servers send queue but the clients receive queue is empty.
If I remove the MSG_WAITALL flag from the recv() call, the app completes successfully.
The hanging issue only arises using the loopback interface on 1 particular machine. I suspect this may all be related to timing dependencies.
As I drop the size of the 'file', the likelihood of the hang occurring is reduced
The source for the test app can be found here:
Socket test source
The tcpdump capture from the loopback interface can be found here:
tcpdump capture
I reproduce the issue by issuing the following commands:
> gcc socket_test.c -o socket_test
> perl -e 'for (1..6000000){ print "a" }' | nc -l 60000
> ./socket_test 6000000
This sees 6000000 bytes sent to the test app which tries to read the data using a single call to recv().
I would love to hear any suggestions on what I might be doing wrong or any further ways to debug the issue.
MSG_WAITALL should block until all data has been received. From the manual page on recv:
This flag requests that the operation block until the full request is satisfied.
However, the buffers in the network stack probably are not large enough to contain everything, which is the reason for the error messages on the server. The client network stack simply can't hold that much data.
The solution is either to increase the buffer sizes (SO_RCVBUF option to setsockopt), split the message into smaller pieces, or receiving smaller chunks putting it into your own buffer. The last is what I would recommend.
Edit: I see in your code that you already do what I suggested (read smaller chunks with own buffering,) so just remove the MSG_WAITALL flag and it should work.
Oh, and when recv returns zero, that means the other end have closed the connection, and that you should do it too.
Consider these two possible rules:
The receiver may wait for the sender to send more before receiving what has already been sent.
The sender may wait for the receiver to receive what has already been sent before sending more.
We can have either of these rules, but we cannot have both of these rules.
Why? Because if the receiver is permitted to wait for the sender, that means the sender cannot wait for the receiver to receive before sending more, otherwise we deadlock. And if the sender is permitted to wait for the receiver, that means the receiver cannot wait for the sender to send before receiving more, otherwise we deadlock.
If both of these things happen at the same time, we deadlock. The sender will not send more until the receiver receives what has already been sent, and the receiver will not receive what has already been sent unless the sender send more. Boom.
TCP chooses rule 2 (for reasons that should be obvious). Thus it cannot support rule 1. But in your code, you are the receiver, and you are waiting for the sender to send more before you receive what has already been sent. So this will deadlock.