send()'s man page reveals the MSG_MORE flag which is asserted to act like TCP_CORK. I have a wrapper function around send():
int SocketConnection_Write(SocketConnection *this, void *buf, int len) {
errno = 0;
int sent = send(this->fd, buf, len, MSG_NOSIGNAL);
if (errno == EPIPE || errno == ENOTCONN) {
throw(exc, &SocketConnection_NotConnectedException);
} else if (errno == ECONNRESET) {
throw(exc, &SocketConnection_ConnectionResetException);
} else if (sent != len) {
throw(exc, &SocketConnection_LengthMismatchException);
}
return sent;
}
Assuming I want to use the kernel buffer, I could go with TCP_CORK, enable whenever it is necessary and then disable it to flush the buffer. But on the other hand, thereby the need for an additional system call arises. Thus, the usage of MSG_MORE seems more appropriate to me. I'd simply change the above send() line to:
int sent = send(this->fd, buf, len, MSG_NOSIGNAL | MSG_MORE);
According to lwm.net, packets will be flushed automatically if they are large enough:
If an application sets that option on
a socket, the kernel will not send out
short packets. Instead, it will wait
until enough data has shown up to fill
a maximum-size packet, then send it.
When TCP_CORK is turned off, any
remaining data will go out on the
wire.
But this section only refers to TCP_CORK. Now, what is the proper way to flush MSG_MORE packets?
I can only think of two possibilities:
Call send() with an empty buffer and without MSG_MORE being set
Re-apply the TCP_CORK option as described on this page
Unfortunately the whole topic is very poorly documented and I couldn't find much on the Internet.
I am also wondering how to check that everything works as expected? Obviously running the server through strace is not an option. So the simplest way would be to use netcat and then look at its strace output? Or will the kernel handle traffic transmitted over a loopback interface differently?
I have taken a look at the kernel source and both assumptions seem to be true. The following code are extracts from net/ipv4/tcp.c (2.6.33.1).
static inline void tcp_push(struct sock *sk, int flags, int mss_now,
int nonagle)
{
struct tcp_sock *tp = tcp_sk(sk);
if (tcp_send_head(sk)) {
struct sk_buff *skb = tcp_write_queue_tail(sk);
if (!(flags & MSG_MORE) || forced_push(tp))
tcp_mark_push(tp, skb);
tcp_mark_urg(tp, flags, skb);
__tcp_push_pending_frames(sk, mss_now,
(flags & MSG_MORE) ? TCP_NAGLE_CORK : nonagle);
}
}
Hence, if the flag is not set, the pending frames will definitely be flushed. But this is be only the case when the buffer is not empty:
static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int poffset,
size_t psize, int flags)
{
(...)
ssize_t copied;
(...)
copied = 0;
while (psize > 0) {
(...)
if (forced_push(tp)) {
tcp_mark_push(tp, skb);
__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
} else if (skb == tcp_send_head(sk))
tcp_push_one(sk, mss_now);
continue;
wait_for_sndbuf:
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
wait_for_memory:
if (copied)
tcp_push(sk, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);
if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
goto do_error;
mss_now = tcp_send_mss(sk, &size_goal, flags);
}
out:
if (copied)
tcp_push(sk, flags, mss_now, tp->nonagle);
return copied;
do_error:
if (copied)
goto out;
out_err:
return sk_stream_error(sk, flags, err);
}
The while loop's body will never be executed because psize is not greater 0. Then, in the out section, there is another chance, tcp_push() gets called but because copied still has its default value, it will fail as well.
So sending a packet with the length 0 will never result in a flush.
The next theory was to re-apply TCP_CORK. Let's take a look at the code first:
static int do_tcp_setsockopt(struct sock *sk, int level,
int optname, char __user *optval, unsigned int optlen)
{
(...)
switch (optname) {
(...)
case TCP_NODELAY:
if (val) {
/* TCP_NODELAY is weaker than TCP_CORK, so that
* this option on corked socket is remembered, but
* it is not activated until cork is cleared.
*
* However, when TCP_NODELAY is set we make
* an explicit push, which overrides even TCP_CORK
* for currently queued segments.
*/
tp->nonagle |= TCP_NAGLE_OFF|TCP_NAGLE_PUSH;
tcp_push_pending_frames(sk);
} else {
tp->nonagle &= ~TCP_NAGLE_OFF;
}
break;
case TCP_CORK:
/* When set indicates to always queue non-full frames.
* Later the user clears this option and we transmit
* any pending partial frames in the queue. This is
* meant to be used alongside sendfile() to get properly
* filled frames when the user (for example) must write
* out headers with a write() call first and then use
* sendfile to send out the data parts.
*
* TCP_CORK can be set together with TCP_NODELAY and it is
* stronger than TCP_NODELAY.
*/
if (val) {
tp->nonagle |= TCP_NAGLE_CORK;
} else {
tp->nonagle &= ~TCP_NAGLE_CORK;
if (tp->nonagle&TCP_NAGLE_OFF)
tp->nonagle |= TCP_NAGLE_PUSH;
tcp_push_pending_frames(sk);
}
break;
(...)
As you can see, there are two ways to flush. You can either set TCP_NODELAY to 1 or TCP_CORK to 0. Luckily, both won't check whether the flag is already set. Thus, my initial plan to re-apply the TCP_CORK flag can be optimized to just disable it, even if it's currently not set.
I hope this helps someone with similar issues.
That's a lot of research... all I can offer is this empirical post note:
Sending a bunch of packet with MSG_MORE set, followed by a packet without MSG_MORE, the whole lot goes out. It works a treat for something like this:
for (i=0; i<mg_live.length; i++) {
// [...]
if ((n = pth_send(sock, query, len, MSG_MORE | MSG_NOSIGNAL)) < len) {
printf("error writing to socket (sent %i bytes of %i)\n", n, len);
exit(1);
}
}
}
pth_send(sock, "END\n", 4, MSG_NOSIGNAL);
That is, when you're sending out all the packets at once, and have a clearly defined end... AND you are only using one socket.
If you tried writing to another socket in the middle of the above loop, you may find that Linux releases the previously held packets. At least that appears to be the trouble I'm having right now. But it might be an easy solution for you.
Related
I am writing a little web server which involves epoll and multithread. For small and short http/1.1 requests and responses, it works as expected. But when working with large size file downloads, it is always interrupted by the timer I devised. I expire the timers with a fixed timeout value, but I also have a if statement to check if the response was sent successfully.
static void
_expire_timers(list_t *timers, long timeout)
{
httpconn_t *conn;
int sockfd;
node_t *timer;
long cur_time;
long stamp;
timer = list_first(timers);
if (timer) {
cur_time = mstime();
do {
stamp = list_node_stamp(timer);
conn = (httpconn_t *)list_node_data(timer);
if ((cur_time - stamp >= timeout) && httpconn_close(conn)) {
sockfd = httpconn_sockfd(conn);
DEBSI("[CONN] socket closed, server disconnected", sockfd);
close(sockfd);
list_del(timers, stamp);
}
timer = list_next(timers);
} while (timer);
}
}
I realized that in a non-blocking environment, the write() function might be interrupted during the request-response communication. I wonder how long write() can hold or how much data write() can send, so I can tweek the timout setting in my code.
This is the code which involves write(),
void
http_rep_get(int clifd, void *cache, char *path, void *req)
{
httpmsg_t *rep;
int len_msg;
char *bytes;
rep = _get_rep_msg((list_t *)cache, path, req);
bytes = msg_create_rep(rep, &len_msg);
/* send msg */
DEBSI("[REP] Sending reply msg...", clifd);
write(clifd, bytes, len_msg);
/* send body */
DEBSI("[REP] Sending body...", clifd);
write(clifd, msg_body_start(rep), msg_body_len(rep));
free(bytes);
msg_destroy(rep, 0);
}
And the following is the epoll loop I use to process the incoming requests,
do {
nevents = epoll_wait(epfd, events, MAXEVENTS, HTTP_KEEPALIVE_TIME);
if (nevents == -1) perror("epoll_wait()");
/* expire the timers */
_expire_timers(timers, HTTP_KEEPALIVE_TIME);
/* loop through events */
for (i = 0; i < nevents; i++) {
conn = (httpconn_t *)events[i].data.ptr;
sockfd = httpconn_sockfd(conn);
/* error case */
if ((events[i].events & EPOLLERR) || (events[i].events & EPOLLHUP) ||
(!(events[i].events & EPOLLIN))) {
perror("EPOLL ERR|HUP");
list_update(timers, conn, mstime());
break;
}
else if (sockfd == srvfd) {
_receive_conn(srvfd, epfd, cache, timers);
}
else {
/* client socket; read client data and process it */
thpool_add_task(taskpool, httpconn_task, conn);
}
}
} while (svc_running);
The http_rep_get() is executed by the threadpool handler httpconn_task(), HTTP_KEEPALIVE_TIME is the fixed timeout. The handler httpconn_task() will add a timer to the timers once a request arrives. Since the write() is executed in http_rep_get(), I think it might be interrupted by the timers. I guess I can change the way to write to the clients, but I need to make sure how much the write() can do.
If you are interested, you may browser my project to help me with this.
https://github.com/grassroot72/Maestro
Cheers,
Edward
Is there a size limit of write() for a socket fd?
It depends on what you mean by a limit.
As the comments explain, a write call may write fewer bytes than you ask it to. Furthermore, this is expected behavior if you perform a large write to a socket. However, there is no reliable way to determine (or predict) how many bytes will be written before you call write.
The correct way to deal with this is to check how many bytes were actually written each time, and use a loop for ensure that all bytes are written (or until you get a failure).
I am developing a Linux module which I want to use to run my C program from kernel mode.
My problem here, in function read() of the module, I need to use a function named eval_keycode(), which is defined in my user space program.
When I try to compile my module, this error occurs :
error: implicit declaration of function ‘eval_keycode’
which is confirming my problem described above.
This is the read() function of my module :
ssize_t exer_read(struct file *pfile, char __user *buffer, size_t length, loff_t *offset) {
struct file *f = pfile->private_data;
enum { MAX_BUF_SIZE = 4096 };
size_t buf_size = 0;
char *buf = NULL;
ssize_t total = 0;
ssize_t rc = 0;
struct input_event *ev;
int yalv;
/* Allocate temporary buffer. */
if (length) {
buf_size = min_t(size_t, MAX_BUF_SIZE, length);
ev = kmalloc(buf_size, GFP_KERNEL);
if (ev == NULL) {
return -ENOMEM;
}
}
/* Read file to buffer in chunks. */
do {
size_t amount = min_t(size_t, length, buf_size);
rc = kernel_read(f, ev, amount, offset);
if (rc > 0) {
/* Have read some data from file. */
if (copy_to_user(buffer, ev, rc) != 0) {
/* Bad user memory! */
rc = -EFAULT;
} else {
/* Update totals. */
total += rc;
buffer += rc;
*offset += rc;
length -= rc;
for (yalv = 0; yalv < (int) (rc / sizeof(struct input_event)); yalv++) {
if (ev[yalv].type == EV_KEY) {
if (ev[yalv].value == 0)
eval_keycode(ev[yalv].code);
}
}
if (rc < amount) {
/* Didn't read the full amount, so terminate early. */
rc = 0;
}
}
}
}
while (rc > 0 && length > 0);
/* Free temporary buffer. */
kfree(buf);
if (total > 0) {
return total;
}
return rc;
}
This is my user space eval_keycode() defined function :
void eval_keycode(int code)
{
static int red_state = 0;
static int green_state = 0;
switch (code) {
case 260:
printf("BTN left pressed\n");
/* figure out red state */
red_state = red_state ? 0 : 1;
change_led_state(LED_PATH "/" red "/brightness", red_state);
break;
case BTN_RIGHT:
printf("BTN right pressed\n");
/* figure out green state */
green_state = green_state ? 0 : 1;
change_led_state(LED_PATH "/" green "/brightness", green_state);
break;
}
}
How can call the eval_keycode function from user space in order to solve this problem ?
Thank you.
You can, but it is a really bad idea. You need to establish a pointer to your user mode function, arrange for the process containing that function to be running (in the kernel) when you invoke it. That is a lot of work, and is fundamentally malware due to the security holes it creates. Additionally, in the mad dash to lock the door to the now empty barn in the wake of spectre et al, new layers of hackery are being deployed in newer CPUs to make this even harder.
A different approach:
In your original query, you are running this driver as a "tee"; that is, you take the input you receive from the device, give a copy to the caller, and call eval_keycode with each input. Eval_keycode doesn't modify the data, and the kernel module discards it afterwards. So Eval_keycode doesn't really need to be a function; or rather, there could be a user function:
void ProcessEvents(int fd) {
struct input_event ev;
while (read(fd, &ev, sizeof ev) == sizeof ev) {
eval_keycode(&ev);
}
}
if you could arrange for all the events to be fed into that fd. With this setup, your problem becomes more plumbing than kernel renovation. The user creates a pipe/socket/fifo/... and passes the write end to your kernel module (yay more ioctl()s). Your kernel module can then carefully use kernel_write() ( or vfs_write if you are stuck in the past ) to make these events available to the user handler. It wants to be careful about where its blocking points are.
You could extend this to work as a transform; that is where your driver transforms the events via a user mode handler; but at that point, you might really consider FUSE a better solution.
There is no traditional (in the way a library works) way to "call" a user space "function".
Your user space code should be running in its' own process (or another user space process), in which you would implement communications (through shared memory, interprocess calls [IPC], device files, interrupts..) where you handle the exchange of data, and act on the data (e.g. calling your eval_keycode function).
You basically want an upcall. You can find some explanation about that here, but it doesn't seem like Linux has an official upcall API.
However, as others have already mentioned, this isn't very good design. Upcalls are useful to servers implemented in the kernel.
If your exer_read() is only called for your own code (on your files for which you're implementing the driver), then perhaps inotify would be a better design.
If your exer_read() can be called for any file (e.g. you want any file write on the machine to change the LED state), then you want your userspace process containing eval_keycode() to poll some character device, and you want your module to write the code to this character device instead of calling eval_keycode().
If, however, change_led_state() is synchronous, and you actually need the read to block until it returns, then you are advised to reconsider your design... but that's a valid use case for upcalls.
I am kind of new to use libpcap.
I am using this library to capture the packet and the code i wrote to the capture the packet is below.
The interface that I am tapping is always flooded with arp packet so there is always packet coming to the interface.But I cannot able to tap these packet. The interface is UP and running.
I got no error on pcap_open_live function.
The code is in C. And I am running this code on FreeBSD10 machine 32 bit.
void captutre_packet(char* ifname , int snaplen) {
char ebuf[PCAP_ERRBUF_SIZE];
int pflag = 0;/*promiscuous mode*/
snaplen = 100;
pcap_t* pcap = pcap_open_live(ifname, snaplen, !pflag , 0, ebuf);
if(pcap!=NULL) {
printf("pcap_open_live for %s \n" ,ifname );
}
int fd = pcap_get_selectable_fd(pcap);
pcap_setnonblock(pcap, 1, ebuf);
fd_set fds;
struct timeval tv;
FD_ZERO(&fds);
FD_SET(fd, &fds);
tv.tv_sec = 3;
tv.tv_usec = 0;
int retval = select(fd + 1, &fds, NULL, NULL, &tv);
if (retval == -1)
perror("select()");
else if (retval) {
printf("Data is available now.\n");
printf("calling pcap_dispatch \n");
pcap_dispatch(pcap , -1 , (pcap_handler) callback , NULL);
}
else
printf("No data within 3 seconds.\n");
}
void
callback(const char *unused, struct pcap_pkthdr *h, uint8_t *packet)
{
printf("got some packet \n");
}
I am always getting retval as 0 which is timeout.
I don't know what is happening under the hood I follow the tutorial and they also did exactly the same thing I do not know what i am missing.
I also want to understand how the packet from the ethernet layer once received get copied into this opened bpf socket/device (using pcap_open_live) and how the buffer is copied from kernel space to user space?
And for how long we can tap the packet till the kernel consume or reject the packet?
The pcap_open_live() call provided 0 as the packet buffer timeout value (the fourth argument). libpcap does not specify what a value of 0 means, because different packet capture mechanisms, on different operating systems, treat that value differently.
On systems using BPF, such as the BSDs and macOS, it means "wait until the packet buffer is completely full before providing the packets. If the packet buffer is large (it defaults to about 256K, on FreeBSD), and the packets are small (60 bytes for ARP packets), it may take a significant amount of time for the buffer to fill - longer than the timeout you're handing to select().
It's probably best to have a timeout value of between 100 milliseconds and 1 second, so pass an argument of somewhere between 100 and 1000, not 0.
The script file has over 6000 bytes which is copied into a buffer.The contents of the buffer are then written to the device connected to the serial port.However the write function only returns 4608 bytes whereas the buffer contains 6117 bytes.I'm unable to understand why this happens.
{
FILE *ptr;
long numbytes;
int i;
ptr=fopen("compass_script(1).4th","r");//Opening the script file
if(ptr==NULL)
return 1;
fseek(ptr,0,SEEK_END);
numbytes = ftell(ptr);//Number of bytes in the script
printf("number of bytes in the calibration script %ld\n",numbytes);
//Number of bytes in the script is 6117.
fseek(ptr,0,SEEK_SET);
char writebuffer[numbytes];//Creating a buffer to copy the file
if(writebuffer == NULL)
return 1;
int s=fread(writebuffer,sizeof(char),numbytes,ptr);
//Transferring contents into the buffer
perror("fread");
fclose(ptr);
fd = open("/dev/ttyUSB3",O_RDWR | O_NOCTTY | O_NONBLOCK);
//Opening serial port
speed_t baud=B115200;
struct termios serialset;//Setting a baud rate for communication
tcgetattr(fd,&serialset);
cfsetispeed(&serialset,baud);
cfsetospeed(&serialset,baud);
tcsetattr(fd,TCSANOW,&serialset);
long bytesw=0;
tcflush(fd,TCIFLUSH);
printf("\nnumbytes %ld",numbytes);
bytesw=write(fd,writebuffer,numbytes);
//Writing the script into the device connected to the serial port
printf("bytes written%ld\n",bytesw);//Only 4608 bytes are written
close (fd);
return 0;
}
Well, that's the specification. When you write to a file, your process normally is blocked until the whole data is written. And this means your process will run again only when all the data has been written to the disk buffers. This is not true for devices, as the device driver is the responsible of determining how much data is to be written in one pass. This means that, depending on the device driver, you'll get all data driven, only part of it, or even none at all. That simply depends on the device, and how the driver implements its control.
On the floor, device drivers normally have a limited amount of memory to fill buffers and are capable of a limited amount of data to be accepted. There are two policies here, the driver can block the process until more buffer space is available to process it, or it can return with a partial write only.
It's your program resposibility to accept a partial read and continue writing the rest of the buffer, or to pass back the problem to the client module and return only a partial write again. This approach is the most flexible one, and is the one implemented everywhere. Now you have a reason for your partial write, but the ball is on your roof, you have to decide what to do next.
Also, be careful, as you use long for the ftell() function call return value and int for the fwrite() function call... Although your amount of data is not huge and it's not probable that this values cannot be converted to long and int respectively, the return type of both calls is size_t and ssize_t resp. (like the speed_t type you use for the baudrate values) long can be 32bit and size_t a 64bit type.
The best thing you can do is to ensure the whole buffer is written by some code snippet like the next one:
char *p = buffer;
while (numbytes > 0) {
ssize_t n = write(fd, p, numbytes);
if (n < 0) {
perror("write");
/* driver signals some error */
return 1;
}
/* writing 0 bytes is weird, but possible, consider putting
* some code here to cope for that possibility. */
/* n >= 0 */
/* update pointer and numbytes */
p += n;
numbytes -= n;
}
/* if we get here, we have written all numbytes */
Why is the following code slow? And by slow I mean 100x-1000x slow. It just repeatedly performs read/write directly on a TCP socket. The curious part is that it remains slow only if I use two function calls for both read AND write as shown below. If I change either the server or the client code to use a single function call (as in the comments), it becomes super fast.
Code snippet:
int main(...) {
int sock = ...; // open TCP socket
int i;
char buf[100000];
for(i=0;i<2000;++i)
{ if(amServer)
{ write(sock,buf,10);
// read(sock,buf,20);
read(sock,buf,10);
read(sock,buf,10);
}else
{ read(sock,buf,10);
// write(sock,buf,20);
write(sock,buf,10);
write(sock,buf,10);
}
}
close(sock);
}
We stumbled on this in a larger program, that was actually using stdio buffering. It mysteriously became sluggish the moment payload size exceeded the buffer size by a small margin. Then I did some digging around with strace, and finally boiled the problem down to this. I can solve this by fooling around with buffering strategy, but I'd really like to know what on earth is going on here. On my machine, it goes from 0.030 s to over a minute on my machine (tested both locally and over remote machines) when I change the two read calls to a single call.
These tests were done on various Linux distros, and various kernel versions. Same result.
Fully runnable code with networking boilerplate:
#include <netdb.h>
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <netinet/ip.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
static int getsockaddr(const char* name,const char* port, struct sockaddr* res)
{
struct addrinfo* list;
if(getaddrinfo(name,port,NULL,&list) < 0) return -1;
for(;list!=NULL && list->ai_family!=AF_INET;list=list->ai_next);
if(!list) return -1;
memcpy(res,list->ai_addr,list->ai_addrlen);
freeaddrinfo(list);
return 0;
}
// used as sock=tcpConnect(...); ...; close(sock);
static int tcpConnect(struct sockaddr_in* sa)
{
int outsock;
if((outsock=socket(AF_INET,SOCK_STREAM,0))<0) return -1;
if(connect(outsock,(struct sockaddr*)sa,sizeof(*sa))<0) return -1;
return outsock;
}
int tcpConnectTo(const char* server, const char* port)
{
struct sockaddr_in sa;
if(getsockaddr(server,port,(struct sockaddr*)&sa)<0) return -1;
int sock=tcpConnect(&sa); if(sock<0) return -1;
return sock;
}
int tcpListenAny(const char* portn)
{
in_port_t port;
int outsock;
if(sscanf(portn,"%hu",&port)<1) return -1;
if((outsock=socket(AF_INET,SOCK_STREAM,0))<0) return -1;
int reuse = 1;
if(setsockopt(outsock,SOL_SOCKET,SO_REUSEADDR,
(const char*)&reuse,sizeof(reuse))<0) return fprintf(stderr,"setsockopt() failed\n"),-1;
struct sockaddr_in sa = { .sin_family=AF_INET, .sin_port=htons(port)
, .sin_addr={INADDR_ANY} };
if(bind(outsock,(struct sockaddr*)&sa,sizeof(sa))<0) return fprintf(stderr,"Bind failed\n"),-1;
if(listen(outsock,SOMAXCONN)<0) return fprintf(stderr,"Listen failed\n"),-1;
return outsock;
}
int tcpAccept(const char* port)
{
int listenSock, sock;
listenSock = tcpListenAny(port);
if((sock=accept(listenSock,0,0))<0) return fprintf(stderr,"Accept failed\n"),-1;
close(listenSock);
return sock;
}
void writeLoop(int fd,const char* buf,size_t n)
{
// Don't even bother incrementing buffer pointer
while(n) n-=write(fd,buf,n);
}
void readLoop(int fd,char* buf,size_t n)
{
while(n) n-=read(fd,buf,n);
}
int main(int argc,char* argv[])
{
if(argc<3)
{ fprintf(stderr,"Usage: round {server_addr|--} port\n");
return -1;
}
bool amServer = (strcmp("--",argv[1])==0);
int sock;
if(amServer) sock=tcpAccept(argv[2]);
else sock=tcpConnectTo(argv[1],argv[2]);
if(sock<0) { fprintf(stderr,"Connection failed\n"); return -1; }
int i;
char buf[100000] = { 0 };
for(i=0;i<4000;++i)
{
if(amServer)
{ writeLoop(sock,buf,10);
readLoop(sock,buf,20);
//readLoop(sock,buf,10);
//readLoop(sock,buf,10);
}else
{ readLoop(sock,buf,10);
writeLoop(sock,buf,20);
//writeLoop(sock,buf,10);
//writeLoop(sock,buf,10);
}
}
close(sock);
return 0;
}
EDIT: This version is slightly different from the other snippet in that it reads/writes in a loop. So in this version, two separate writes automatically causes two separate read() calls, even if readLoop is called only once. But otherwise the problem still remains.
Interesting. You are being a victim of the Nagle's algorithm together with TCP delayed acknowledgements.
The Nagle's algorithm is a mechanism used in TCP to defer transmission of small segments until enough data has been accumulated that makes it worth building and sending a segment over the network. From the wikipedia article:
Nagle's algorithm works by combining a number of small outgoing
messages, and sending them all at once. Specifically, as long as there
is a sent packet for which the sender has received no acknowledgment,
the sender should keep buffering its output until it has a full
packet's worth of output, so that output can be sent all at once.
However, TCP typically employs something known as TCP delayed acknowledgements, which is a technique that consists of accumulating together a batch of ACK replies (because TCP uses cumulative ACKS), to reduce network traffic.
That wikipedia article further mentions this:
With both algorithms enabled, applications that do two successive
writes to a TCP connection, followed by a read that will not be
fulfilled until after the data from the second write has reached the
destination, experience a constant delay of up to 500 milliseconds,
the "ACK delay".
(Emphasis mine)
In your specific case, since the server doesn't send more data before reading the reply, the client is causing the delay: if the client writes twice, the second write will be delayed.
If Nagle's algorithm is being used by the sending party, data will be
queued by the sender until an ACK is received. If the sender does not
send enough data to fill the maximum segment size (for example, if it
performs two small writes followed by a blocking read) then the
transfer will pause up to the ACK delay timeout.
So, when the client makes 2 write calls, this is what happens:
Client issues the first write.
The server receives some data. It doesn't acknowledge it in the hope that more data will arrive (so it can batch up a bunch of ACKs in one single ACK).
Client issues the second write. The previous write has not been acknowledged, so Nagle's algorithm defers transmission until more data arrives (until enough data has been collected to make a segment) or the previous write is ACKed.
Server is tired of waiting and after 500 ms acknowledges the segment.
Client finally completes the 2nd write.
With 1 write, this is what happens:
Client issues the first write.
The server receives some data. It doesn't acknowledge it in the hope that more data will arrive (so it can batch up a bunch of ACKs in one single ACK).
The server writes to the socket. An ACK is part of the TCP header, so if you're writing, you might as well acknowledge the previous segment at no extra cost. Do it.
Meanwhile, the client wrote once, so it was already waiting on the next read - there was no 2nd write waiting for the server's ACK.
If you want to keep writing twice on the client side, you need to disable the Nagle's algorithm. This is the solution proposed by the algorithm author himself:
The user-level solution is to avoid write-write-read sequences on
sockets. write-read-write-read is fine. write-write-write is fine. But
write-write-read is a killer. So, if you can, buffer up your little
writes to TCP and send them all at once. Using the standard UNIX I/O
package and flushing write before each read usually works.
(See the citation on Wikipedia)
As mentioned by David Schwartz in the comments, this may not be the greatest idea for various reasons, but it illustrates the point and shows that this is indeed causing the delay.
To disable it, you need to set the TCP_NODELAY option on the sockets with setsockopt(2).
This can be done in tcpConnectTo() for the client:
int tcpConnectTo(const char* server, const char* port)
{
struct sockaddr_in sa;
if(getsockaddr(server,port,(struct sockaddr*)&sa)<0) return -1;
int sock=tcpConnect(&sa); if(sock<0) return -1;
int val = 1;
if (setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &val, sizeof(val)) < 0)
perror("setsockopt(2) error");
return sock;
}
And in tcpAccept() for the server:
int tcpAccept(const char* port)
{
int listenSock, sock;
listenSock = tcpListenAny(port);
if((sock=accept(listenSock,0,0))<0) return fprintf(stderr,"Accept failed\n"),-1;
close(listenSock);
int val = 1;
if (setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &val, sizeof(val)) < 0)
perror("setsockopt(2) error");
return sock;
}
It's interesting to see the huge difference this makes.
If you'd rather not mess with the socket options, it's enough to ensure that the client writes once - and only once - before the next read. You can still have the server read twice:
for(i=0;i<4000;++i)
{
if(amServer)
{ writeLoop(sock,buf,10);
//readLoop(sock,buf,20);
readLoop(sock,buf,10);
readLoop(sock,buf,10);
}else
{ readLoop(sock,buf,10);
writeLoop(sock,buf,20);
//writeLoop(sock,buf,10);
//writeLoop(sock,buf,10);
}
}