writing framed data without extra write() cost - c

So I'm sending data on a TCP socket, prefixed with the size of data, as so:
write(socket, &length, sizeof(length));
write(socket, data, length);
(Note: I have wrapper writen functions as described in the Unix Network Programming book, and am checking for errors, etc. The above is just for the simplicity of this question).
Now, my experience is that breaking up data into multiple writes can cause significant slowdown. I have had success speeding things up by creating my own buffer, then sending out one big chunk.
However, in the above case data may be incredibly large (lets say 1 Gig). I don't want to create a buffer 1 Gig large + 4 bytes, just to be able to have one write() call. Is there any way of doing something akin to:
write(socket, &length, data, sizeof(length) + length)
without paying the price of a large memory allocation ahead of time? I suppose I could just pre-allocate a chunk the size of write's buffer, and continuously send that (the below code has errors, namely, should be sending &chunk + 4 in some instances, but this is just the idea):
length += 4;
char chunk[buffer_size];
var total = 0;
while (total < length)
{
if (total < 4)
{
memcpy(&chunk, &length, 4);
total += 4;
}
memcpy(&chunk, data + total, min(buffer_size, length - total));
write(sock, &chunk, min(buffer_size, length - total));
total += min(buffer_size, length - total);
}
But in that case I don't know what write's buffer size actually is (is there an API to get it?) I also don't know if this is an appropriate solution.

There is an option to do this already. It will inform your network layer that you are going to send more data and you want to buffer rather than send it as soon as possible.
setsockopt(sock_descriptor, IPPROTO_TCP, TCP_CORK, (char *)&val, sizeof(val));
val is an int, and should be 0 or 1, with the "cork" on, your network layer will buffer things as much as possible, to only send full packets, you might want to "pop the cork" and "cork" again to handle the next batch of transmissions that you need to make on the socket.
Your idea is correct, this just saves you the trouble of implementing it, since it's already done in the network stack.

I suggest having a look at writev() (see man writev for full details).
This allows you to send multiple buffers in one go, with just one call. As a simple example, to send out two chunks in one go (one for length, one for data):
struct iovec bits[2];
/* First chunk is the length */
bits[0].iov_base = &length;
bits[0].iov_len = sizeof(length);
/* Second chunk is the payload */
bits[1].iov_base = data;
bits[1].iov_base = length;
/* Send two chunks at once */
writev(socket, bits, 2);
It can get more complicated if you need to use a variable number of chunks (you may need to allocate the array of struct iov dynamically), but the advantage is that, if your chunks are large, you can avoid copying them, and just manipulate pointer/length pairs, which are much smaller.

I think you are on the right track with your spooled solution presented. I think buffer_size should be larger than that used internally by the network stack. This way, you minimize the amount of per-write overhead without having to allocate a giant buffer. In other words, by giving the underlying network subsystem more data than it can handle at once, it is free to run at its fastest speed, spending most of its time moving data, rather than waiting for more data to be provided.
The optimal buffer_size value might vary from system to system. I would start with 1MB and do some experiments up and down from there to see what works best. There might also be values you can extract and adjust with a sysctl call for the current internal buffer size used on your system. Read this for a suggested technique. You might also use something like getsockopt(..., SO_MAX_MSG_SIZE, ...), as explained here.
Ethernet packets can range up to about 64K in size, so perhaps anything larger than 64K is sufficient. Read about maximum transmission unit (MTU) sizes to get a sense of what the lowest layers of the network stack are doing, and don't forget that the MTU varies with the network interface, not the process or kernel.
Beware that the MTU can vary along the route from your server to the data's destination. You can use ifconfig or traceroute/tracepath to discover it. With networking, every link in the chain is weak. ;)

Related

Dynamically allocate enough memory on client and server receive part

I wish to dynamically allocate enough memory on the server and clients receive part when I send a package. I.e. if I send a package of 512 bytes to client from server, I want the clients sides char* receive_data to allocate corresponding amount of memory of the package.
Here's some pseudo code:
I was thinking of having an counter which loops 2 times, on the first loop it'll send the client/server the length of the package and on the second run it'll send the package itself.
int write_all(int socket, void *buffer, size_t *length){
int ret;
int counter = 0;
while (counter != 2) {
if (counter == 0) {
// Write() the length to client/server.
} else {
ret = write(socket, buffer, length);
if(ret < 1) break;
}
counter++;
}
return ret == -1 ? -1 : 0;
Is this a good or bad way to do this? I might have tunnel vision, if you have a more suitable way of doing this, please share.
Edit: http://beej.us/guide/bgnet/output/html/singlepage/bgnet.html#sendall I've read this part, and thought that perhaps be another way to do it. But then I have to statically allocate a char
(char receive_data[512];). Which might work as well, but is it as flexible as the way I'm trying to do it?
Edit: This first part of the answer deals with the suitability of alternatingly sending message size and message proper as a communication protocol
It will work, unless you can get out of sync. (I.e. could the receiver possibly miss the size packet but receive the data packet?)
You should also consider security implications: Can an attacker cause a denial of service by sending you an absurdly large data size, causing your server process to either swap massively or die.
Usually your problem is addressed by logically splitting network packets into a header and a body part, and some field in the header specifies the packet length. And the maximum length is limited to a reasonable value. This is very similar with what you do.
Edit: This second part of the answer deals with short reads/writes
After reading your latest comment, I guess your problem is:
Your communication is over TCP. This is your first time you do something like this. You've read about short reads and short writes, i.e. read() reading fewer bytes than you told it to, or write writing fewer bytes than you had available. You've read about how people usually deal with this (bookkeeping), but you wonder if there might be an easier solution: What if you just allocated enough memory, so that there would be no need for the read and write system calls to return without reading/writing everything you told them to.
Is that about about correct?
Then my response is: No, this is not a valid solution for the short read / short write problem. The limiting factor is not the memory that you provide to these system calls. All of your reads and writes may still end up shorter than you told them to. There is no easy way out of the bookkeeping.

how to design a server for variable size messages

I want some feedback or suggestion on how to design a server that handles variable size messages.
to simplify the answer lets assume:
single thread epoll() based
the protocol is: data-size + data
data is stored on a ringbuffer
the read code, with some simplification for clarity, looks like this:
if (client->readable) {
if (client->remaining > 0) {
/* SIMPLIFIED FOR CLARITY - assume we are always able to read 1+ bytes */
rd = read(client->sock, client->buffer, client->remaining);
client->buffer += rd;
client->remaining -= rd;
} else {
/* SIMPLIFIED FOR CLARITY - assume we are always able to read 4 bytes */
read(client->sock, &(client->remaining), 4);
client->buffer = acquire_ringbuf_slot(client->remaining);
}
}
please, do not focus on the 4 byte. just assume we have the data size in the beginning compressed or not does not make difference for this discussion.
now, the question is: what is the best way to do the above?
assume both small "data", few bytes and large data MBs
how can we reduce the number of read() calls? e.g. in case we have 4 message of 16 bytes on the stream, it seems a waste doing 8 calls to read().
are there better alternatives to this design?
PART of the solution depends on the transport layer protocol you use.
I assume you are using TCP which provides connection oriented and reliable communication.
From your code I assume you understand TCP is a stream-oriented protocol
(So when a client sends a piece of data, that data is stored in the socket send buffer and TCP may use one or more TCP segments to convey it to the other end (server)).
So the code, looks very good so far (considering you have error checks and other things in the real code).
Now for your questions, I give you my responses, what I think is best based on my experience (but there could be better solutions):
1-This is a solution with challenges similar to how an OS manages memory, dealing with fragmentation.
For handling different message sizes, you have to understand there are always trade-offs depending on your performance goals.
One solution to improve memory utilization and parallelization is to have a list of free buffer chunks of certain size, say 4KB.
You will retrieve as many as you need for storing your received message. In the last one you will have unused data. You play with internal fragmentation.
The drawback could be when you need to apply certain type of processing (maybe a visitor pattern) on the message, like parsing/routing/transformation/etc. It will be more complex and less efficient than a case of a huge buffer of contiguous memory. On the other side, the drawback of a huge buffer is much less efficient memory utilization, memory bottlenecks, and less parallelization.
You can implement something smarter in the middle (think about chunks that could also be contiguous whenever available). Always depending on your goals. Something useful is to implement an abstraction over the fragmented memory so that every function (or visitor) that is applied works as it were dealing with contiguous memory.
If you use these chunks, when the message was processed and dropped/forwarded/eaten/whatever, you return the unused chunks to the list of free chunks.
2-The number of read calls will depend on how fast TCP conveys the data from client to server. Remember this is stream oriented and you don't have much control over it. Of course, I'm assuming you try to read the max possible (remaining) data in each read.
If you use the chunks I mentioned above the max data to read will also depend on the chunk size.
Something you can do at TCP layer is to increase the server receive buffer. Thus, it can receive more data even when server cannot read it fast enough.
3-The ring buffer is OK, if you used chunked, the ring buffer should provide the abstraction. But I don't know why you need a ring buffer.
I like ring buffers because there is a way of implementing producer-consumer synchronization without locking (Linux Kernel uses this for moving packets from L2 to IP layer) but I don't know if that's your goal.
To pass messages to other components and/or upper-layers you could also use ring buffers of pointers to messages.
A better design may be as follows:
Set up your user-space socket read buffer to be the same size as the kernel socket buffer. If your user-space socket read buffer is smaller, then you would need more than one read syscall to read the kernel buffer. If your user-space buffer is bigger, then the extra space is wasted.
Your read function should only read as much data as possible in one read syscall. This function must not know anything about the protocol. This way you do not need to re-implement this function for different wire formats.
When your read function has read into the user-space buffer it should call a callback passing the iterators to the data available in the buffer. That callback is a parser function that should extract all available complete messages and pass these messages to another higher-level callback. Upon return the parser function should return the number of bytes consumed, so that these bytes can be discarded from the user-space socket buffer.

C Windows buffer size

In windows lets say i'm using the recv function to receive data from a socket.
I'm curious how big would an optimal buffer be? I could make it 1024 bytes or I could make it 51200 bytes, or bigger. I'm wondering which one would be better for performance.
This doesn't only apply to the recv function, lets say im reading a large text file, do i want a very large buffer, or a smaller buffer?
THe operating system performs its own buffering, so the size of your buffer does not really matter. the performance penalty lies in the function call: a 1 byte buffer will be inefficient because it will require too much calls to recv(). a buffer too big is just a waste of space.
an optimal size will be something like twice the size of the data you expect to receive or are able to process in a single recv() call, with a lower limit of approximately 1 or 2 tcp frame.
i personally use a 4KB buffer, but that's my own preference, and it depends largely on the application i am writing.
The operating system buffers it already, so you might as well just read one byte at the time.
This would depend on the kind of data you are expecting and the protocol you expecting.
UDP for example would give you a whole packet in one go so an optimal buffer size could be 1500.
The real performance impediment would be the function call (recv in this case) that you will make. To improve performance, you could start with a default large value and then analyze the packet sizes being received. Based on the analysis, you could pass an (almost) "ideal" buffer size.
If you have a server-type situation where you don't know what services are to be run on it, you could use an array of buffer pools of varying sizes, eg [128,1024,4096,16384,65536]. When something connects, use a 128 size and, if all 128 come in, use 1024 next time.. and so on.
On clients or servers with a known protocol/loading, just sorta guess at it, (much as suggested by the other posters :)
Rgds,
Martin

Buffering data from sockets?

I am trying to make a simple HTTP server that would be able to parse client requests and send responses back.
Now I have a problem. I have to read and handle one line at a time in the request, and I don't know if I should:
read one byte at a time, or
read chunks of N bytes at a time, put them in a buffer, and then handle the bytes one by one, before reading a new chunk of bytes.
What would be the best option, and why?
Also, are there some alternative solutions to this? Like a function that would read a line at a time from the socket or something?
Single byte at a time is going to kill performance. Consider a circular buffer of decent size.
Read chunks of whatever size is free in the buffer. Most of the time you will get short reads. Check for the end of the http command in the read bytes. Process complete commands and next byte becomes head of buffer. If buffer becomes full, copy it off to a backup buffer, use a second circular buffer, report an error or whatever is appropriate.
The short answer to your question is that I would go with reading a single byte at a time. Unfortunately its one of those cases where there are pros and cons for both cases.
For the use of a buffer is the fact that the implementation can be more efficient from the perspective of the network IO. Against the use of a buffer, I think that the code will be inherently more complex than the single byte version. So its an efficiency vs complexity trade off. The good news is that you can implement the simple solution first, profile the result and "upgrage" to a buffered approach if testing shows it to be worthwhile.
Also, just to note, as a thought experiment I wrote some pseudo code for a loop that does buffer based reads of http packets, included below. The complexity to implement a buffered read doesn't seem to bad. Note however that I haven't given much consideration to error handling, or tested if this will work at all. However, it should avoid excessive "double handling" of data, which is important since that would reduce the efficiency gains which were the purpose of this approach.
#define CHUNK_SIZE 1024
nextHttpBytesRead = 0;
nextHttp = NULL;
while (1)
{
size_t httpBytesRead = nextHttpBytesRead;
size_t thisHttpSize;
char *http = nextHttp;
char *temp;
char *httpTerminator;
do
{
temp = realloc(httpBytesRead + CHUNK_SIZE);
if (NULL == temp)
...
http = temp;
httpBytesRead += read(httpSocket, http + httpBytesRead, CHUNK_SIZE);
httpTerminator = strstr(http, "\r\n\r\n");
}while (NULL == httpTerminator)
thisHttpSize = ((int)httpTerminator - (int)http + 4; // Include terminator
nextHttpBytesRead = httpBytesRead - thisHttpSize;
// Adding CHUNK_SIZE here means that the first realloc won't have to do any work
nextHttp = malloc(nextHttpBytesRead + CHUNK_SIZE);
memcpy(nextHttp, http + thisHttpSize, nextHttpSize);
http[thisHttpSize] = '\0';
processHttp(http);
}
TCP data stream is coming in at one IP packet at a time, which can be up to 1,500 or so bytes depending on the IP layer configuration. In Linux this is going to be wait in one SKB waiting for the application layer to read off the queue. If you read one byte at a time you suffer the overhead of context switches between the application and kernel for simply copying one byte from one structure to another. The optimum solution is to use non-blocking IO to read the content of one SKB at a time and so minimize the switches.
If you are after optimum bandwidth you could read off longer size of bytes in order to further reduce the context switches at the expensive of latency as more time will be spent out of the application copying memory. But this only applies to extremes and such code changes should be implemented when required.
If you examine the multitude of existing HTTP technologies you can find alternative approaches such as using multiple threads and blocking sockets, pushing more work back into the kernel to reduce the overhead of switching into the application and back.
I have implemented a HTTP server library very similar to torak's pseudo code here, http://code.google.com/p/openpgm/source/browse/trunk/openpgm/pgm/http.c The biggest speed improvements for this implementation came from making everything asynchronous so nothing ever blocks.
Indy, for example, takes the buffered approach. When the code asks Indy to read a line, it first checks its current buffer to see if a line break is present. If not, the network is read in chunks and appended to the buffer until the line break appears. Once it does, just the data up to the line break is removed from the buffer and returned to the app, leaving any remaining data in the buffer for the next reading operation. This makes for a good balance between a simple application-level API (ReadLine, ReadStream, etc), while providing for efficient network I/O (read everything that is currently available in the socket, buffer it, and ack it so the sender is not waiting too long - fewer network-level reads are needed this way).
Start by reading one byte at a time (though noting that lines end with cr/lf in HTTP) because it's simple. If that's not sufficient, do more complex things.
Read a byte array buffer at a time. Reading single characters will be dog slow because of the multiple context switches between user and kernel mode (depending on the libc actually).
If you read buffers, you need to be prepared that the buffer could eighter be not filled completely (watch the length return), that the buffer does not contain enough bytes to line end or that the buffer contains more than one line.
It is a common pattern in network applications how to map your line or fixed-size block requests to that variable steam of buffers (and often implemented wrong, for example a 0 byte length answer is possible). Higher languages will hide you from this complexity.

Is it better to send 1 large chunk or lots of small ones when using TCP?

After I accept() a connection, and then write() to the client socket, is it better to write all the data you intend to send at once or send it in chunks?
For example:
accept, write 1MB, disconnect
…or…
accept, write 256 bytes, write 256 bytes, … n, disconnect
My gut feeling tells me that the underlying protocol does this automatically, with error correction, etc. Is this correct, or should I be chunking my data?
Before you ask, no I'm not sure where I got the idea to chunk the data – I think it's an instinct I've picked up from programming C# web services (to get around receive buffer limits, etc, I think). Bad habit?
Note: I'm using C
The client and server will break up your data as they see fit, so you can send as much as you like in one chunk. Check A User's Guide to TCP Windows article by Von Welch.
Years and years ago, I had an application that send binary data - it did one send with the size of the following buffer, and then another send with the buffer (a few hundred bytes). And after profiling, we discovered that we could get a major speed-up by making them into one buffer, and sending it just once. We were surprised - even though there is some network overhead on each packet, we didn't think that was going to be a noticeable factor.
From a TCP level, yes your big buffer will be split up when it is too large, and it will be combined when it is too small.
From an application level, don't let your application deal with unbounded buffer sizes. At some level you need to split them up.
If you are sending a file over a socket, and perhaps processing some of this file's data, like compressing it. Then you need to split this up into chunks. Otherwise you will use too much RAM when you eventually happen upon a large file and your program will be out of RAM.
RAM is not the only problem. If your buffer gets too big, you may spend too much time reading in the data, or processing it, and you won't be using the socket that is sitting there waiting for data. For this reason it's best to have a parameter for the buffer size so that you can determine a value that is not too small, nor too big.
My claim is not that a TCP socket can't handle a big chunk of data, it can and I suggest to use bigger buffers when sending to get better efficiency. My claim is to just don't deal with unbounded buffer sizes in your application.
The Nagle Algorithm, which is usually enabled by default on TCP sockets, will likely combine those four 256 byte writes into the same packet. So it really doesn't matter if you send it as one write or several, it should end up in one packet anyways. Sending it as one chunk makes more sense if you have a big chunk to begin with.
If you're computing the data between those writes, it may be better to stream them as they're available. Also, writing them all at once may produce buffer overruns (though that's probably rare, it does happen), meaning that your app needs to pause and re-try the writes (not all of them, just from the point where you hit the overflow.)
I wouldn't usually go out of my way to chunk the writes, especially not as small as 256 byte chunks. (Since roughly 1500 bytes can fit in an Ethernet packet after TCP/IP overhead, I'd use chunks at least that large.)
I would send all in one big chunk as the underlying layers in osi modell . Therefor you dont have to worry about how big chunks you are sending as the layers will split these up as necisarry.
The only absolute answer is to profile app in case. There are so many factors that it is not possible to give exact answer thah is correct in all cases.

Resources