libuv simple send udp - c

I'm doing a multiplatform shared library in C, which sends UDP messages using libuv, however I don't know much about libuv and I don't know if my implementation is good, or if there is another solution besides libuv.
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <uv.h>
#define IP "0.0.0.0"
#define PORT 8090
#define STR_BUFFER 256
void on_send(uv_udp_send_t *req, int status) {
if (status) {
fprintf(stderr, "Send error %s\n", uv_strerror(status));
return;
}
}
int send_udp(char *msg){
uv_loop_t *loop = malloc(sizeof(uv_loop_t));
uv_loop_init(loop);
uv_udp_t send_socket;
uv_udp_init(loop, &send_socket);
struct sockaddr_in send_addr;
uv_ip4_addr(IP, PORT, &send_addr);
uv_udp_bind(&send_socket, (const struct sockaddr*)&send_addr, 0);
char buff[STR_BUFFER];
memset(buff,0,STR_BUFFER);
strcpy(buff,msg);
uv_buf_t buffer = uv_buf_init(buff,STR_BUFFER);
uv_udp_send_t send_req;
uv_udp_send(&send_req, &send_socket, &buffer, 1, (const struct sockaddr*)&send_addr, on_send);
uv_run(loop, UV_RUN_ONCE);
uv_loop_close(loop);
free(loop);
return 0;
}
int main() {
send_udp("test 123\n");
return 0;
}

Your implementation has multiple issues to date:
I'm not sure a single loop iteration is enough to send an UDP message on every platform. This is something you can check easily with the value returned by uv_run, see the documentation for uv_run when using the UV_RUN_ONCE mode:
UV_RUN_ONCE: Poll for i/o once. Note that this function blocks if there are no pending callbacks. Returns zero when done (no active handles or requests left), or non-zero if more callbacks are expected (meaning you should run the event loop again sometime in the future).
If you would keep your code as-is, I would suggest to do at least this:
int done;
do {
done = uv_run(loop, UV_RUN_ONCE);
} while (done != 0);
But keep on reading, you can do even better ! :)
It's quite costly in terms of performance, uv_loops are supposed to be long lasting, not to be created for each message sent.
Incomplete error handling: uv_udp_bind, uv_udp_send, ... they can fail !
How to improve
I would suggest you to change your code for one of the two following solutions:
Your library is used in a libuv context (a.k.a, you don't try to hide the libuv implementation detail but require all people who wish to use your library to use libuv explicitly.
You could then change your function signature to something like int send_udp(uv_loop_t *loop, char *msg) and let the library users manage the event loop and run it.
Your library uses libuv as an implementation detail: you don't want to bother your library users with libuv, therefore its your reponsibility to provide robust and performant code. This is how I would do it:
mylib_init: starts a thread and run an uv_loop on it
send_udp: push the message on a queue (beware of thread-safety), notify your loop it has a message to send (you can use uv_async for this), then you can send the message with approximately the same code you are already using.
mylib_shutdown: stop the loop and the thread (again, you can use an uv_async to call uv_stop from the right thread)
It would look like this (I don't have a compiler to test, but you'll have most of the work done):
static uv_thread_t thread; // our network thread
static uv_loop_t loop; // the loop running on the thread
static uv_async_t notify_send; // to notify the thread it has messages to send
static uv_async_t notify_shutdown; // to notify the thread it must shutdown
static queue_t buffer_queue; // a queue of messages to send
static uv_mutex_t buffer_queue_mutex; // to sync access to the queue from the various threads
static void thread_entry(void *arg);
static void on_send_messages(uv_async_t *handle);
static void on_shutdown(uv_async_t *handle);
int mylib_init() {
// will call thread_entry on a new thread, our network thread
return uv_thread_create(&thread, thread_entry, NULL);
}
int send_udp(char *msg) {
uv_mutex_lock(&buffer_queue_mutex);
queue_enqueue(&buffer_queue, strdup(msg)); // don't forget to free() after sending the message
uv_async_send(&notify_send);
uv_mutex_unlock(&buffer_queue_mutex);
}
int mylib_shutdown() {
// will call on_shutdown on the loop thread
uv_async_send(&notify_shutdown);
// wait for the thread to stop
return uv_thread_join(&thread);
}
static void thread_entry(void *arg) {
uv_loop_init(&loop);
uv_mutex_init_recursive(&buffer_queue_mutex);
uv_async_init(&loop, &notify_send, on_send_messages);
uv_async_init(&loop, &notify_shutdown, on_shutdown);
uv_run(&loop, UV_RUN_DEFAULT); // this code will not return until uv_stop is called
uv_mutex_destroy(&buffer_queue_mutex);
uv_loop_close(&loop);
}
static void on_send_messages(uv_async_t *handle) {
uv_mutex_lock(&buffer_queue_mutex);
char *msg = NULL;
// for each member of the queue ...
while (queue_dequeue(&buffer_queue, &msg) == 0) {
// create a uv_udp_t, send the message
}
uv_mutex_unlock(&buffer_queue_mutex);
}
static void on_shutdown(uv_async_t *handle) {
uv_stop(&loop);
}
It's up to you to develop or find a queue implementation ;)
Usage
int main() {
mylib_init();
send_udp("my super message");
mylib_shutdown();
}

Related

Reading global flag does not work for CPU>GPU data exchange in CUDA

I try to do a simple uni-directional communication between a CPU and a K80 GPU using CUDA. I want to have a bool cancel flag that resides in global memory and is polled by all running GPU/kernel threads. The flag should default to false and can be set by a CPU/host thread to true during ongoing computation. The GPU/kernel threads then should exit.
This is what I tried. I have simplified code. I removed error checking and application logic (including the application logic that prevents concurrent access to cancelRequested).
On the host side, global definition (.cpp):
// Host side thread safety of this pointer is covered by application logic
volatile bool* cancelRequested = nullptr;
On the host side in the compute thread (.cpp):
initialize(&cancelRequested);
compute(cancelRequested);
finalize(&cancelRequested);
On the host side in a main thread (.cpp):
cancel(cancelRequested); // Called after init is finished
Host routines (.cu file):
void initialize(volatile bool** pCancelRequested)
{
cudaMalloc(const_cast<bool**>(pCancelRequested), sizeof(bool));
const bool aFalse = false;
cudaMemcpy(*const_cast<bool**>(pCancelRequested), &aFalse, sizeof(bool), cudaMemcpyHostToDevice);
}
void compute(volatile bool* pCancelRequested)
{
....
computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);
cudaDeviceSynchronize(); // Non-busy wait
....
}
void finalize(volatile bool** pCancelRequested)
{
cudaFree(*const_cast<bool**>(pCancelRequested));
*pCancelRequested = nullptr;
}
void cancel(volatile bool* pCancelRequested)
{
const bool aTrue = true;
cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);
}
Device routines (.cu file):
__global__ void computeKernel(volatile bool* pCancelRequested)
{
while (someCondition)
{
// Computation step here
if (*pCancelRequested)
{
printf("-> Cancel requested!\n");
return;
}
}
}
The code runs fine. But it does never enter the cancel case. I read back the false and true values in initialize() and cancel() successfully and checked them using gdb. I.e. writing to the global flag works fine, at least from host side view point. However the kernels never see the cancel flag set to true and exit normally from the outer while loop.
Any idea why this doesn't work?
The fundamental problem I see with your approach is that cuda streams will prevent it from working.
CUDA streams have two basic principles:
Items issued into the same stream will not overlap; they will serialize.
Items issued into separate created streams have the possibility to overlap; there is no defined ordering of those operations provided by CUDA.
Even if you don't explicitly use streams, you are operating in the "default stream" and the same stream semantics apply.
I'm not covering everything there is to know about streams in this brief summary. You can learn more about CUDA streams in unit 7 of this online training series
Because of CUDA streams, this:
computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);
and this:
cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);
could not possibly overlap (they are being issued into the same "default" CUDA stream, and so rule 1 above says that they cannot possibly overlap). But overlap is essential if you want to "signal" the running kernel. We must allow the cudaMemcpy operation to take place at the same time that the kernel is running.
We can fix this via a direct application of CUDA streams (taking note of rule 2 above), to put the copy operation and the compute (kernel) operation into separate created streams, so as to allow them to overlap. When we do that, things work as desired:
$ cat t2184.cu
#include <iostream>
#include <unistd.h>
__global__ void k(volatile int *flag){
while (*flag != 0);
}
int main(){
int *flag, *h_flag = new int;
cudaStream_t s[2];
cudaStreamCreate(s+0);
cudaStreamCreate(s+1);
cudaMalloc(&flag, sizeof(h_flag[0]));
*h_flag = 1;
cudaMemcpy(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice);
k<<<32, 256, 0, s[0]>>>(flag);
sleep(5);
*h_flag = 0;
cudaMemcpyAsync(flag, h_flag, sizeof(h_flag[0]), cudaMemcpyHostToDevice, s[1]);
cudaDeviceSynchronize();
}
$ nvcc -o t2184 t2184.cu
$ compute-sanitizer ./t2184
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
$
NOTES:
Although not evident from the static text printout, the program spends approximately 5 seconds before exiting. If you comment out a line such as *h_flag = 0; then the program will hang, indicating that the flag signal method is working correctly.
Note the use of volatile. This is necessary to instruct the compiler that any access to that data must be an actual access, the compiler is not allowed to make modifications that would prevent a memory read or write from occurring at the expected location.
This kind of host->device signal behavior can also be realized without explicit use of streams, but with host pinned memory as the signalling location, since it is "visible" to both host and device code, "simultaneously". Here is an example:
#include <iostream>
#include <unistd.h>
__global__ void k(volatile int *flag){
while (*flag != 0);
}
int main(){
int *flag;
cudaHostAlloc(&flag, sizeof(flag[0]), cudaHostAllocDefault);
*flag = 1;
k<<<32, 256>>>(flag);
sleep(5);
*flag = 0;
cudaDeviceSynchronize();
}
For other examples of signalling, such as from device to host, other readers may be interested in this.

Callbacks in AIO asynchronous I/O

I have found discussion on using callbacks in AIO asynchronous I/O on the internet. However, what I have found has left me confused. An example code is listed below from a site on Linux AIO. In this code, AIO is being used to read in the contents of a file. My problem is that it seems to me that a code that actually processes the contents of that file must have some point where some kind of block is made to the execution until the read is completed. This code here has no block like that at all. I was expecting to see some kind of call analogous to pthread_mutex_lock in pthread programming. I suppose I could put in a dummy loop after the aio_read() call that would block execution until the read is completed. But that puts me right back to the simplest way of blocking the execution, and then I don't see what is gained by all the coding overhead that goes into establishing a callback. I am obviously missing something. Could someone tell me what it is?
Here is the code. (BTW, the original is in C++; I have adapted it to C.)
#include <stdio.h>
#include <stdlib.h>
#include <strings.h>
#include <aio.h>
//#include <bits/stdc++.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
const int BUFSIZE = 1024;
void aio_completion_handler(sigval_t sigval)
{
struct aiocb *req;
req = (struct aiocb *)sigval.sival_ptr; //Pay attention here.
/*Check again if the asynchrony is complete?*/
if (aio_error(req) == 0)
{
int ret = aio_return(req);
printf("ret == %d\n", ret);
printf("%s\n", (char *)req->aio_buf);
}
close(req->aio_fildes);
free((void *)req->aio_buf);
while (1)
{
printf("The callback function is being executed...\n");
sleep(1);
}
}
int main(void)
{
struct aiocb my_aiocb;
int fd = open("file.txt", O_RDONLY);
if (fd < 0)
perror("open");
bzero((char *)&my_aiocb, sizeof(my_aiocb));
my_aiocb.aio_buf = malloc(BUFSIZE);
if (!my_aiocb.aio_buf)
perror("my_aiocb.aio_buf");
my_aiocb.aio_fildes = fd;
my_aiocb.aio_nbytes = BUFSIZE;
my_aiocb.aio_offset = 0;
//Fill in callback information
/*
Using SIGEV_THREAD to request a thread callback function as a notification method
*/
my_aiocb.aio_sigevent.sigev_notify = SIGEV_THREAD;
my_aiocb.aio_sigevent.sigev_notify_function = aio_completion_handler;
my_aiocb.aio_sigevent.sigev_notify_attributes = NULL;
/*
The context to be transmitted is loaded into the handler (in this case, a reference to the aiocb request itself).
In this handler, we simply refer to the arrived sigval pointer and use the AIO function to verify that the request has been completed.
*/
my_aiocb.aio_sigevent.sigev_value.sival_ptr = &my_aiocb;
int ret = aio_read(&my_aiocb);
if (ret < 0)
perror("aio_read");
/* <---- A real code would process the data read from the file.
* So execution needs to be blocked until it is clear that the
* read is complete. Right here I could put in:
* while (aio_error(%my_aiocb) == EINPROGRESS) {}
* But is there some other way involving a callback?
* If not, what has creating a callback done for me?
*/
//The calling process continues to execute
while (1)
{
printf("The main thread continues to execute...\n");
sleep(1);
}
return 0;
}

How could i send a message to another program, and output that it has been received?

In contiki, i need to have two files, sender and receiver, the sender sends packets to the receiver. My problem is, the receiver is not outputting that the packets have been received.
I tried a while loop inside the receiving packet, i even tried to create a function, but still nothing has worked.
My sender.c file
#include "contiki.h"
#include "net/rime.h"
#include "random.h"
#include "dev/button-sensor.h"
#include "dev/leds.h"
#include <stdio.h>
PROCESS(sendReceive, "Hello There");
AUTOSTART_PROCESSES(&sendReceive);
PROCESS_THREAD(sendReceive, ev, data)
{
PROCESS_BEGIN();
static struct abc_conn abc;
static struct etimer et;
static const struct abc_callbacks abc_call;
PROCESS_EXITHANDLER(abc_close(&abc);)
abc_open(&abc, 128, &abc_call);
while(1)
{
/* Delay 2-4 seconds */
etimer_set(&et, CLOCK_SECOND * 2 + random_rand() % (CLOCK_SECOND * 2));
PROCESS_WAIT_EVENT_UNTIL(etimer_expired(&et));
packetbuf_copyfrom("Hello", 6);
abc_send(&abc);
printf("Message sent\n");
}
PROCESS_END();
}
my receiver.c file
#include "contiki.h"
#include "net/rime.h"
#include "random.h"
#include "dev/button-sensor.h"
#include "dev/leds.h"
#include <stdio.h>
PROCESS(sendReceive, "Receiving Message");
AUTOSTART_PROCESSES(&sendReceive);
PROCESS_THREAD(sendReceive, ev, data)
{
PROCESS_BEGIN();
{
printf("Message received '%s'\n", (char *)packetbuf_dataptr());
}
PROCESS_END();
}
The sender.c file is working, it is sending the packets correctly, the problem is the receiver seems not to output that it has been received.
While sending is simple - you just need to call a function -, receiving data in embedded system is in general more complicated. There needs to be a way for the operating system to let your code know that new data has arrived from outside. In Contiki that is internally done with events, and from user's perspective with callbacks.
So, implement a callback function:
static void
recv_from_abc(struct abc_conn *bc)
{
printf("Message received '%s'\n", (char *)packetbuf_dataptr());
}
In your receiver process, create and open an connection, passing the callback function's pointer as a parameter:
static struct abc_conn c;
static const struct abc_callbacks callbacks =
{recv_from_abc, NULL};
uint16_t channel = 128; /* matching the sender code */
abc_open(&c, channel, &callbacks);

How to implement a message queue in standard C

I have a project (micro controller STM32 using c code) where I need to receive messages from serial port (for example strings) and I need to put the messages in a queue where I will read the string later.
Can someone tell me where can I find some example on how to create a message queue (like FIFO) of strings (or byte array) using standard C and how to manage the queue? Thanks for any kind of support.
"example on how to create e message queue (like FIFO) of strings (or byte array) using standard C and how to manage the queue"
"in a micro controller with a standard C you should manage the buffers, create the queue, enqueue and dequeue the elements"
The example given below should meet the requirements.
If necessary, the library functions used can easily be replaced with platform-specific versions or standard C array operations.
The memory allocation for the queue can also be done as static variable instead of as stack variable. If desired, even malloc could be used.
The message type can easily be extended. The queue and data sizes are defined as constants.
#leonardo gave a good hint on how to structure the processing, i.e. enqueuing messages in an interrupt routine and dequeuing them on main. I guess that some kind of semaphore needs to be used so that the execution of functions which manipulate the queue don't get mixed up. Some thoughts on this are discussed in semaphore like synchronization in ISR (Interrupt service routine)
/*
Portable array-based cyclic FIFO queue.
*/
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#define MESSAGE_SIZE 64
#define QUEUE_SIZE 3
typedef struct {
char data[MESSAGE_SIZE];
} MESSAGE;
typedef struct {
MESSAGE messages[QUEUE_SIZE];
int begin;
int end;
int current_load;
} QUEUE;
void init_queue(QUEUE *queue) {
queue->begin = 0;
queue->end = 0;
queue->current_load = 0;
memset(&queue->messages[0], 0, QUEUE_SIZE * sizeof(MESSAGE_SIZE));
}
bool enque(QUEUE *queue, MESSAGE *message) {
if (queue->current_load < QUEUE_SIZE) {
if (queue->end == QUEUE_SIZE) {
queue->end = 0;
}
queue->messages[queue->end] = *message;
queue->end++;
queue->current_load++;
return true;
} else {
return false;
}
}
bool deque(QUEUE *queue, MESSAGE *message) {
if (queue->current_load > 0) {
*message = queue->messages[queue->begin];
memset(&queue->messages[queue->begin], 0, sizeof(MESSAGE));
queue->begin = (queue->begin + 1) % QUEUE_SIZE;
queue->current_load--;
return true;
} else {
return false;
}
}
int main(int argc, char** argv) {
QUEUE queue;
init_queue(&queue);
MESSAGE message1 = {"This is"};
MESSAGE message2 = {"a simple"};
MESSAGE message3 = {"queue!"};
enque(&queue, &message1);
enque(&queue, &message2);
enque(&queue, &message3);
MESSAGE rec;
while (deque(&queue, &rec)) {
printf("%s\n", &rec.data[0]);
}
}
Compiling and running:
$ gcc -Wall queue.c
$ ./a.out
This is
a simple
queue!
$
The C language does not have a queue build in (it's a battery excluded language), you need to build your own. If you just need a FIFO to push things on your interrupt routine and then pop them out on your main loop (which is good design BTW), Check A Simple Message Queue for C if this works for you.

Why does TCP socket slow down if done in multiple system calls?

Why is the following code slow? And by slow I mean 100x-1000x slow. It just repeatedly performs read/write directly on a TCP socket. The curious part is that it remains slow only if I use two function calls for both read AND write as shown below. If I change either the server or the client code to use a single function call (as in the comments), it becomes super fast.
Code snippet:
int main(...) {
int sock = ...; // open TCP socket
int i;
char buf[100000];
for(i=0;i<2000;++i)
{ if(amServer)
{ write(sock,buf,10);
// read(sock,buf,20);
read(sock,buf,10);
read(sock,buf,10);
}else
{ read(sock,buf,10);
// write(sock,buf,20);
write(sock,buf,10);
write(sock,buf,10);
}
}
close(sock);
}
We stumbled on this in a larger program, that was actually using stdio buffering. It mysteriously became sluggish the moment payload size exceeded the buffer size by a small margin. Then I did some digging around with strace, and finally boiled the problem down to this. I can solve this by fooling around with buffering strategy, but I'd really like to know what on earth is going on here. On my machine, it goes from 0.030 s to over a minute on my machine (tested both locally and over remote machines) when I change the two read calls to a single call.
These tests were done on various Linux distros, and various kernel versions. Same result.
Fully runnable code with networking boilerplate:
#include <netdb.h>
#include <stdbool.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <netinet/ip.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
static int getsockaddr(const char* name,const char* port, struct sockaddr* res)
{
struct addrinfo* list;
if(getaddrinfo(name,port,NULL,&list) < 0) return -1;
for(;list!=NULL && list->ai_family!=AF_INET;list=list->ai_next);
if(!list) return -1;
memcpy(res,list->ai_addr,list->ai_addrlen);
freeaddrinfo(list);
return 0;
}
// used as sock=tcpConnect(...); ...; close(sock);
static int tcpConnect(struct sockaddr_in* sa)
{
int outsock;
if((outsock=socket(AF_INET,SOCK_STREAM,0))<0) return -1;
if(connect(outsock,(struct sockaddr*)sa,sizeof(*sa))<0) return -1;
return outsock;
}
int tcpConnectTo(const char* server, const char* port)
{
struct sockaddr_in sa;
if(getsockaddr(server,port,(struct sockaddr*)&sa)<0) return -1;
int sock=tcpConnect(&sa); if(sock<0) return -1;
return sock;
}
int tcpListenAny(const char* portn)
{
in_port_t port;
int outsock;
if(sscanf(portn,"%hu",&port)<1) return -1;
if((outsock=socket(AF_INET,SOCK_STREAM,0))<0) return -1;
int reuse = 1;
if(setsockopt(outsock,SOL_SOCKET,SO_REUSEADDR,
(const char*)&reuse,sizeof(reuse))<0) return fprintf(stderr,"setsockopt() failed\n"),-1;
struct sockaddr_in sa = { .sin_family=AF_INET, .sin_port=htons(port)
, .sin_addr={INADDR_ANY} };
if(bind(outsock,(struct sockaddr*)&sa,sizeof(sa))<0) return fprintf(stderr,"Bind failed\n"),-1;
if(listen(outsock,SOMAXCONN)<0) return fprintf(stderr,"Listen failed\n"),-1;
return outsock;
}
int tcpAccept(const char* port)
{
int listenSock, sock;
listenSock = tcpListenAny(port);
if((sock=accept(listenSock,0,0))<0) return fprintf(stderr,"Accept failed\n"),-1;
close(listenSock);
return sock;
}
void writeLoop(int fd,const char* buf,size_t n)
{
// Don't even bother incrementing buffer pointer
while(n) n-=write(fd,buf,n);
}
void readLoop(int fd,char* buf,size_t n)
{
while(n) n-=read(fd,buf,n);
}
int main(int argc,char* argv[])
{
if(argc<3)
{ fprintf(stderr,"Usage: round {server_addr|--} port\n");
return -1;
}
bool amServer = (strcmp("--",argv[1])==0);
int sock;
if(amServer) sock=tcpAccept(argv[2]);
else sock=tcpConnectTo(argv[1],argv[2]);
if(sock<0) { fprintf(stderr,"Connection failed\n"); return -1; }
int i;
char buf[100000] = { 0 };
for(i=0;i<4000;++i)
{
if(amServer)
{ writeLoop(sock,buf,10);
readLoop(sock,buf,20);
//readLoop(sock,buf,10);
//readLoop(sock,buf,10);
}else
{ readLoop(sock,buf,10);
writeLoop(sock,buf,20);
//writeLoop(sock,buf,10);
//writeLoop(sock,buf,10);
}
}
close(sock);
return 0;
}
EDIT: This version is slightly different from the other snippet in that it reads/writes in a loop. So in this version, two separate writes automatically causes two separate read() calls, even if readLoop is called only once. But otherwise the problem still remains.
Interesting. You are being a victim of the Nagle's algorithm together with TCP delayed acknowledgements.
The Nagle's algorithm is a mechanism used in TCP to defer transmission of small segments until enough data has been accumulated that makes it worth building and sending a segment over the network. From the wikipedia article:
Nagle's algorithm works by combining a number of small outgoing
messages, and sending them all at once. Specifically, as long as there
is a sent packet for which the sender has received no acknowledgment,
the sender should keep buffering its output until it has a full
packet's worth of output, so that output can be sent all at once.
However, TCP typically employs something known as TCP delayed acknowledgements, which is a technique that consists of accumulating together a batch of ACK replies (because TCP uses cumulative ACKS), to reduce network traffic.
That wikipedia article further mentions this:
With both algorithms enabled, applications that do two successive
writes to a TCP connection, followed by a read that will not be
fulfilled until after the data from the second write has reached the
destination, experience a constant delay of up to 500 milliseconds,
the "ACK delay".
(Emphasis mine)
In your specific case, since the server doesn't send more data before reading the reply, the client is causing the delay: if the client writes twice, the second write will be delayed.
If Nagle's algorithm is being used by the sending party, data will be
queued by the sender until an ACK is received. If the sender does not
send enough data to fill the maximum segment size (for example, if it
performs two small writes followed by a blocking read) then the
transfer will pause up to the ACK delay timeout.
So, when the client makes 2 write calls, this is what happens:
Client issues the first write.
The server receives some data. It doesn't acknowledge it in the hope that more data will arrive (so it can batch up a bunch of ACKs in one single ACK).
Client issues the second write. The previous write has not been acknowledged, so Nagle's algorithm defers transmission until more data arrives (until enough data has been collected to make a segment) or the previous write is ACKed.
Server is tired of waiting and after 500 ms acknowledges the segment.
Client finally completes the 2nd write.
With 1 write, this is what happens:
Client issues the first write.
The server receives some data. It doesn't acknowledge it in the hope that more data will arrive (so it can batch up a bunch of ACKs in one single ACK).
The server writes to the socket. An ACK is part of the TCP header, so if you're writing, you might as well acknowledge the previous segment at no extra cost. Do it.
Meanwhile, the client wrote once, so it was already waiting on the next read - there was no 2nd write waiting for the server's ACK.
If you want to keep writing twice on the client side, you need to disable the Nagle's algorithm. This is the solution proposed by the algorithm author himself:
The user-level solution is to avoid write-write-read sequences on
sockets. write-read-write-read is fine. write-write-write is fine. But
write-write-read is a killer. So, if you can, buffer up your little
writes to TCP and send them all at once. Using the standard UNIX I/O
package and flushing write before each read usually works.
(See the citation on Wikipedia)
As mentioned by David Schwartz in the comments, this may not be the greatest idea for various reasons, but it illustrates the point and shows that this is indeed causing the delay.
To disable it, you need to set the TCP_NODELAY option on the sockets with setsockopt(2).
This can be done in tcpConnectTo() for the client:
int tcpConnectTo(const char* server, const char* port)
{
struct sockaddr_in sa;
if(getsockaddr(server,port,(struct sockaddr*)&sa)<0) return -1;
int sock=tcpConnect(&sa); if(sock<0) return -1;
int val = 1;
if (setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &val, sizeof(val)) < 0)
perror("setsockopt(2) error");
return sock;
}
And in tcpAccept() for the server:
int tcpAccept(const char* port)
{
int listenSock, sock;
listenSock = tcpListenAny(port);
if((sock=accept(listenSock,0,0))<0) return fprintf(stderr,"Accept failed\n"),-1;
close(listenSock);
int val = 1;
if (setsockopt(sock, IPPROTO_TCP, TCP_NODELAY, &val, sizeof(val)) < 0)
perror("setsockopt(2) error");
return sock;
}
It's interesting to see the huge difference this makes.
If you'd rather not mess with the socket options, it's enough to ensure that the client writes once - and only once - before the next read. You can still have the server read twice:
for(i=0;i<4000;++i)
{
if(amServer)
{ writeLoop(sock,buf,10);
//readLoop(sock,buf,20);
readLoop(sock,buf,10);
readLoop(sock,buf,10);
}else
{ readLoop(sock,buf,10);
writeLoop(sock,buf,20);
//writeLoop(sock,buf,10);
//writeLoop(sock,buf,10);
}
}

Resources