Buffering data from sockets? - c

I am trying to make a simple HTTP server that would be able to parse client requests and send responses back.
Now I have a problem. I have to read and handle one line at a time in the request, and I don't know if I should:
read one byte at a time, or
read chunks of N bytes at a time, put them in a buffer, and then handle the bytes one by one, before reading a new chunk of bytes.
What would be the best option, and why?
Also, are there some alternative solutions to this? Like a function that would read a line at a time from the socket or something?

Single byte at a time is going to kill performance. Consider a circular buffer of decent size.
Read chunks of whatever size is free in the buffer. Most of the time you will get short reads. Check for the end of the http command in the read bytes. Process complete commands and next byte becomes head of buffer. If buffer becomes full, copy it off to a backup buffer, use a second circular buffer, report an error or whatever is appropriate.

The short answer to your question is that I would go with reading a single byte at a time. Unfortunately its one of those cases where there are pros and cons for both cases.
For the use of a buffer is the fact that the implementation can be more efficient from the perspective of the network IO. Against the use of a buffer, I think that the code will be inherently more complex than the single byte version. So its an efficiency vs complexity trade off. The good news is that you can implement the simple solution first, profile the result and "upgrage" to a buffered approach if testing shows it to be worthwhile.
Also, just to note, as a thought experiment I wrote some pseudo code for a loop that does buffer based reads of http packets, included below. The complexity to implement a buffered read doesn't seem to bad. Note however that I haven't given much consideration to error handling, or tested if this will work at all. However, it should avoid excessive "double handling" of data, which is important since that would reduce the efficiency gains which were the purpose of this approach.
#define CHUNK_SIZE 1024
nextHttpBytesRead = 0;
nextHttp = NULL;
while (1)
{
size_t httpBytesRead = nextHttpBytesRead;
size_t thisHttpSize;
char *http = nextHttp;
char *temp;
char *httpTerminator;
do
{
temp = realloc(httpBytesRead + CHUNK_SIZE);
if (NULL == temp)
...
http = temp;
httpBytesRead += read(httpSocket, http + httpBytesRead, CHUNK_SIZE);
httpTerminator = strstr(http, "\r\n\r\n");
}while (NULL == httpTerminator)
thisHttpSize = ((int)httpTerminator - (int)http + 4; // Include terminator
nextHttpBytesRead = httpBytesRead - thisHttpSize;
// Adding CHUNK_SIZE here means that the first realloc won't have to do any work
nextHttp = malloc(nextHttpBytesRead + CHUNK_SIZE);
memcpy(nextHttp, http + thisHttpSize, nextHttpSize);
http[thisHttpSize] = '\0';
processHttp(http);
}

TCP data stream is coming in at one IP packet at a time, which can be up to 1,500 or so bytes depending on the IP layer configuration. In Linux this is going to be wait in one SKB waiting for the application layer to read off the queue. If you read one byte at a time you suffer the overhead of context switches between the application and kernel for simply copying one byte from one structure to another. The optimum solution is to use non-blocking IO to read the content of one SKB at a time and so minimize the switches.
If you are after optimum bandwidth you could read off longer size of bytes in order to further reduce the context switches at the expensive of latency as more time will be spent out of the application copying memory. But this only applies to extremes and such code changes should be implemented when required.
If you examine the multitude of existing HTTP technologies you can find alternative approaches such as using multiple threads and blocking sockets, pushing more work back into the kernel to reduce the overhead of switching into the application and back.
I have implemented a HTTP server library very similar to torak's pseudo code here, http://code.google.com/p/openpgm/source/browse/trunk/openpgm/pgm/http.c The biggest speed improvements for this implementation came from making everything asynchronous so nothing ever blocks.

Indy, for example, takes the buffered approach. When the code asks Indy to read a line, it first checks its current buffer to see if a line break is present. If not, the network is read in chunks and appended to the buffer until the line break appears. Once it does, just the data up to the line break is removed from the buffer and returned to the app, leaving any remaining data in the buffer for the next reading operation. This makes for a good balance between a simple application-level API (ReadLine, ReadStream, etc), while providing for efficient network I/O (read everything that is currently available in the socket, buffer it, and ack it so the sender is not waiting too long - fewer network-level reads are needed this way).

Start by reading one byte at a time (though noting that lines end with cr/lf in HTTP) because it's simple. If that's not sufficient, do more complex things.

Read a byte array buffer at a time. Reading single characters will be dog slow because of the multiple context switches between user and kernel mode (depending on the libc actually).
If you read buffers, you need to be prepared that the buffer could eighter be not filled completely (watch the length return), that the buffer does not contain enough bytes to line end or that the buffer contains more than one line.
It is a common pattern in network applications how to map your line or fixed-size block requests to that variable steam of buffers (and often implemented wrong, for example a 0 byte length answer is possible). Higher languages will hide you from this complexity.

Related

Why should I use circular buffers when reading and writing to sockets in C?

I'm doing an assignment where the goal is to create a basic FTP server in C capable of handling multiple clients at once.
The subject tells us to "wisely use circular buffers" but I don't really understand why or how ?
I'm already using select to know when I can read or write into my socket without blocking as I'm not allowed to use recv, send or O_NONBLOCKING.
Each connection has a structure where I store everything related to this client like the communication file descriptor, the network informations and the buffers.
Why can't I just use read on my socket into a fixed size buffer and then pass this buffer to the parsing function ?
Same goes for writing : why can't I just dprintf my response into the socket ?
From my point of view using a circular buffer adds a useless layer of complexity just to be translated back into a string to parse the command or to send back the response.
Did I misunderstood the subject ? Instead of storing individual characters should I store commands and responses as circular buffers of strings ?
Why should I use circular buffers when reading and writing to sockets in C?
The socket interface does not itself provide a reason for using circular buffers (a.k.a. ring buffers). You should be looking instead at the protocol requirements of the application using the socket -- the FTP protocol in this case. This will be colored by the characteristics of the underlying network protocol (TCP for FTP) and their effect on the behavior of the socket layer.
Why can't I just use read on my socket into a fixed size buffer and then pass this buffer to the parsing function ?
You surely could do without circular buffers, but that wouldn't be as simple as you seem to suppose. And that's not the question you should be asking anyway: it's not whether circular buffers are required, but what benefit they can provide that you might not otherwise get. More on that later.
Also, you surely can have fixed size circular buffers -- "circular" and "fixed size" are orthogonal characteristics. However, it is usually among the objectives of using a circular buffer to minimize or eliminate any need for dynamically adjusting the buffer size.
Same goes for writing : why can't I just dprintf my response into the socket ?
Again, you probably could do as you describe. The question is what do you stand to gain from interposing a circular buffer? Again, more later.
From my point of view using a circular buffer adds a useless layer of
complexity just to be translated back into a string to parse the
command or to send back the response.
Did I misunderstood the subject ?
That you are talking about translating to and from strings makes me think that you did indeed misunderstand the subject.
Instead of storing individual
characters should I store commands and responses as circular buffers
of strings ?
Again, where do you think "of strings" comes into it? Why are you supposing that the elements of the buffer(s) would represent (whole) messages?
A circular buffer is more a manner of use of an ordinary, flat, usually fixed-size buffer than it is a separate data structure of its own. There is a little bit of extra bookkeeping data involved, however, so I won't quibble with anyone who wants to call it a data structure in its own right.
Circular buffers for input
Among the main contexts for circular buffers' usefulness is data arriving with stream semantics (such as TCP provides) rather than with message semantics (such as UDP provides). With respect to your assignment, consider this: when the server reads command input, how does it know where the command ends? I suspect you're supposing that you will get one complete command per read(), but that is in no way a safe assumption, regardless of the implementation of the client. You may get partial commands, multiple commands, or both on each read(), and you need to be prepared to deal with that.
So suppose, for example, that you receive one and a half control messages in one read(). You can parse and respond to the first, but you need to read more data before you can act on the second. Where do you put that data? Ok, you read it into the end of the buffer. And what if on the next read() you get not only the rest of a message, but also part of another message?
You cannot keep on indefinitely adding data at the end of the buffer, not even if you dynamically allocate more space as needed. You could at some point move the unprocessed data from the tail of the buffer to the beginning, thus opening up space at the end, but that is costly, and at this point we are well past the simplicity you had in mind. (That simplicity was always imaginary.) Alternatively, you can perform your reads into a circular buffer, so that consuming data from the (logical) beginning of the buffer automatically makes space available at the (logical) end.
Circular buffers for output
Similar applies on the writing side with a stream-oriented network protocol. Consider that you cannot write() an arbitrary amount of data at a time, and it is very hard to know in advance exactly how much you can write. That's more likely to bite you on the data connection than on the control connection, but in principle, it applies to both. If you have only one client to feed at a time then you can keep write()ing in a loop until you've successfully transferred all the data, and this is what dprintf() would do. But that's potentially a blocking operation, so it undercuts your responsiveness when you are serving multiple clients at the same time, and maybe even with just one if (as with FTP) there are multiple connections per client.
You need to buffer data on the server, especially for the data connection, and now you have pretty much the same problem that you did on the reading side: when you've written only part of the data you want to send, and the socket is not ready for you to send more, what do you do? You could just track where you are in the buffer, and send more pieces as you can until the buffer is empty. But then you are wasting opportunities to read more data from the source file, or to buffer more control responses, until you work through the buffer. Once again, a circular buffer can mitigate that, by giving you a place to buffer more data without requiring it to start at the beginning of the buffer or being limited by the available space before the physical end of the buffer.

Is there a portable way to discard a number of readable bytes from a socket-like file descriptor?

Is there a portable way to discard a number of incoming bytes from a socket without copying them to userspace? On a regular file, I could use lseek(), but on a socket, it's not possible. I have two scenarios where I might need it:
A stream of records is arriving on a file descriptor (which can be a TCP, a SOCK_STREAM type UNIX domain socket or potentially a pipe). Each record is preceeded by a fixed size header specifying its type and length, followed by data of variable length. I want to read the header first and if it's not of the type I'm interested in, I want to just discard the following data segment without transferring them into user space into a dummy buffer.
A stream of records of varying and unpredictable length is arriving on a file descriptor. Due to asynchronous nature, the records may still be incomplete when the fd becomes readable, or they may be complete but a piece of the next record already may be there when I try to read a fixed number of bytes into a buffer. I want to stop reading the fd at the exact boundary between the records so I don't need to manage partially loaded records I accidentally read from the fd. So, I use recv() with MSG_PEEK flag to read into a buffer, parse the record to determine its completeness and length, and then read again properly (thus actually removing data from the socket) to the exact length. This would copy the data twice - I want to avoid that by simply discarding the data buffered in the socket by an exact amount.
On Linux, I gather it is possible to achieve that by using splice() and redirecting the data to /dev/null without copying them to userspace. However, splice() is Linux-only, and the similar sendfile() that is supported on more platforms can't use a socket as input. My questions are:
Is there a portable way to achieve this? Something that would work on other UNIXes (primarily Solaris) as well that do not have splice()?
Is splice()-ing into /dev/null an efficient way to do this on Linux, or would it be a waste of effort?
Ideally, I would love to have a ssize_t discard(int fd, size_t count) that simply removes count of readable bytes from a file descriptor fd in kernel (i.e. without copying anything to userspace), blocks on blockable fd until the requested number of bytes is discarded, or returns the number of successfully discarded bytes or EAGAIN on a non-blocking fd just like read() would do. And advances the seek position on a regular file of course :)
The short answer is No, there is no portable way to do that.
The sendfile() approach is Linux-specific, because on most other OSes implementing it, the source must be a file or a shared memory object. (I haven't even checked if/in which Linux kernel versions, sendfile() from a socket descriptor to /dev/null is supported. I would be very suspicious of code that does that, to be honest.)
Looking at e.g. Linux kernel sources, and considering how little a ssize_t discard(fd, len) differs from a standard ssize_t read(fd, buf, len), it is obviously possible to add such support. One could even add it via an ioctl (say, SIOCISKIP) for easy support detection.
However, the problem is that you have designed an inefficient approach, and rather than fix the approach at the algorithmic level, you are looking for crutches that would make your approach perform better.
You see, it is very hard to show a case where the "extra copy" (from kernel buffers to userspace buffers) is an actual performance bottleneck. The number of syscalls (context switches between userspace and kernel space) sometimes is. If you sent a patch upstream implementing e.g. ioctl(socketfd, SIOCISKIP, bytes) for TCP and/or Unix domain stream sockets, they would point out that the performance increase this hopes to achieve is better obtained by not trying to obtain the data you don't need in the first place. (In other words, the way you are trying to do things, is inherently inefficient, and rather than create crutches to make that approach work better, you should just choose a better-performing approach.)
In your first case, a process receiving structured data framed by a type and length identifier, wishing to skip unneeded frames, is better fixed by fixing the transfer protocol. For example, the receiving side could inform the sending side which frames it is interested in (i.e., basic filtering approach). If you are stuck with a stupid protocol that you cannot replace for external reasons, you're on your own. (The FLOSS developer community is not, and should not be responsible for maintaining stupid decisions just because someone wails about it. Anyone is free to do so, but they'd need to do it in a manner that does not require others to work extra too.)
In your second case, you already read your data. Don't do that. Instead, use an userspace buffer large enough to hold two full size frames. Whenever you need more data, but the start of the frame is already past the midway of the buffer, memmove() the frame to start at the beginning of the buffer first.
When you have a partially read frame, and you have N unread bytes from that left that you are not interested in, read them into the unused portion of the buffer. There is always enough room, because you can overwrite the portion already used by the current frame, and its beginning is always within the first half of the buffer.
If the frames are small, say 65536 bytes maximum, you should use a tunable for the maximum buffer size. On most desktop and server machines, with high-bandwidth stream sockets, something like 2 MiB (2097152 bytes or more) is much more reasonable. It's not too much memory wasted, but you rarely do any memory copies (and when you do, they tend to be short). (You can even optimize the memory moves so that only full cachelines are copied, aligned, since leaving almost one cacheline of garbage at the start of the buffer is insignificant.)
I do HPC with large datasets (including text-form molecular data, where records are separated by newlines, and custom parsers for converting decimal integers or floating-point values are used for better performance), and this approach does work well in practice. Simply put, skipping data already in your buffer is not something you need to optimize; it is insignificant overhead compared to simply avoiding doing the things you do not need.
There is also the question of what you wish to optimize by doing that: the CPU time/resources used, or the wall clock used in the overall task. They are completely different things.
For example, if you need to sort a large number of text lines from some file, you use the least CPU time if you simply read the entire dataset to memory, construct an array of pointers to each line, sort the pointers, and finally write each line (using either internal buffering and/or POSIX writev() so that you do not need to do a write() syscall for each separate line).
However, if you wish to minimize the wall clock time used, you can use a binary heap or a balanced binary tree instead of an array of pointers, and heapify or insert-in-order each line completely read, so that when the last line is finally read, you already have the lines in their correct order. This is because the storage I/O (for all but pathological input cases, something like single-character lines) takes longer than sorting them using any robust sorting algorithm! The sorting algorithms that work inline (as data comes in) are typically not as CPU-efficient as those that work offline (on complete datasets), so this ends up using somewhat more CPU time; but because the CPU work is done at a time that is otherwise wasted waiting for the entire dataset to load into memory, it is completed in less wall clock time!
If there is need and interest, I can provide a practical example to illustrate the techniques. However, there is absolutely no magic involved, and any C programmer should be able to implement these (both the buffering scheme, and the sort scheme) on their own. (I do consider using resources like Linux man pages online and Wikipedia articles and pseudocode on for example binary heaps doing it "on your own". As long as you do not just copy-paste existing code, I consider it doing it "on your own", even if somebody or some resource helps you find the good, robust ways to do it.)

how to design a server for variable size messages

I want some feedback or suggestion on how to design a server that handles variable size messages.
to simplify the answer lets assume:
single thread epoll() based
the protocol is: data-size + data
data is stored on a ringbuffer
the read code, with some simplification for clarity, looks like this:
if (client->readable) {
if (client->remaining > 0) {
/* SIMPLIFIED FOR CLARITY - assume we are always able to read 1+ bytes */
rd = read(client->sock, client->buffer, client->remaining);
client->buffer += rd;
client->remaining -= rd;
} else {
/* SIMPLIFIED FOR CLARITY - assume we are always able to read 4 bytes */
read(client->sock, &(client->remaining), 4);
client->buffer = acquire_ringbuf_slot(client->remaining);
}
}
please, do not focus on the 4 byte. just assume we have the data size in the beginning compressed or not does not make difference for this discussion.
now, the question is: what is the best way to do the above?
assume both small "data", few bytes and large data MBs
how can we reduce the number of read() calls? e.g. in case we have 4 message of 16 bytes on the stream, it seems a waste doing 8 calls to read().
are there better alternatives to this design?
PART of the solution depends on the transport layer protocol you use.
I assume you are using TCP which provides connection oriented and reliable communication.
From your code I assume you understand TCP is a stream-oriented protocol
(So when a client sends a piece of data, that data is stored in the socket send buffer and TCP may use one or more TCP segments to convey it to the other end (server)).
So the code, looks very good so far (considering you have error checks and other things in the real code).
Now for your questions, I give you my responses, what I think is best based on my experience (but there could be better solutions):
1-This is a solution with challenges similar to how an OS manages memory, dealing with fragmentation.
For handling different message sizes, you have to understand there are always trade-offs depending on your performance goals.
One solution to improve memory utilization and parallelization is to have a list of free buffer chunks of certain size, say 4KB.
You will retrieve as many as you need for storing your received message. In the last one you will have unused data. You play with internal fragmentation.
The drawback could be when you need to apply certain type of processing (maybe a visitor pattern) on the message, like parsing/routing/transformation/etc. It will be more complex and less efficient than a case of a huge buffer of contiguous memory. On the other side, the drawback of a huge buffer is much less efficient memory utilization, memory bottlenecks, and less parallelization.
You can implement something smarter in the middle (think about chunks that could also be contiguous whenever available). Always depending on your goals. Something useful is to implement an abstraction over the fragmented memory so that every function (or visitor) that is applied works as it were dealing with contiguous memory.
If you use these chunks, when the message was processed and dropped/forwarded/eaten/whatever, you return the unused chunks to the list of free chunks.
2-The number of read calls will depend on how fast TCP conveys the data from client to server. Remember this is stream oriented and you don't have much control over it. Of course, I'm assuming you try to read the max possible (remaining) data in each read.
If you use the chunks I mentioned above the max data to read will also depend on the chunk size.
Something you can do at TCP layer is to increase the server receive buffer. Thus, it can receive more data even when server cannot read it fast enough.
3-The ring buffer is OK, if you used chunked, the ring buffer should provide the abstraction. But I don't know why you need a ring buffer.
I like ring buffers because there is a way of implementing producer-consumer synchronization without locking (Linux Kernel uses this for moving packets from L2 to IP layer) but I don't know if that's your goal.
To pass messages to other components and/or upper-layers you could also use ring buffers of pointers to messages.
A better design may be as follows:
Set up your user-space socket read buffer to be the same size as the kernel socket buffer. If your user-space socket read buffer is smaller, then you would need more than one read syscall to read the kernel buffer. If your user-space buffer is bigger, then the extra space is wasted.
Your read function should only read as much data as possible in one read syscall. This function must not know anything about the protocol. This way you do not need to re-implement this function for different wire formats.
When your read function has read into the user-space buffer it should call a callback passing the iterators to the data available in the buffer. That callback is a parser function that should extract all available complete messages and pass these messages to another higher-level callback. Upon return the parser function should return the number of bytes consumed, so that these bytes can be discarded from the user-space socket buffer.

writing framed data without extra write() cost

So I'm sending data on a TCP socket, prefixed with the size of data, as so:
write(socket, &length, sizeof(length));
write(socket, data, length);
(Note: I have wrapper writen functions as described in the Unix Network Programming book, and am checking for errors, etc. The above is just for the simplicity of this question).
Now, my experience is that breaking up data into multiple writes can cause significant slowdown. I have had success speeding things up by creating my own buffer, then sending out one big chunk.
However, in the above case data may be incredibly large (lets say 1 Gig). I don't want to create a buffer 1 Gig large + 4 bytes, just to be able to have one write() call. Is there any way of doing something akin to:
write(socket, &length, data, sizeof(length) + length)
without paying the price of a large memory allocation ahead of time? I suppose I could just pre-allocate a chunk the size of write's buffer, and continuously send that (the below code has errors, namely, should be sending &chunk + 4 in some instances, but this is just the idea):
length += 4;
char chunk[buffer_size];
var total = 0;
while (total < length)
{
if (total < 4)
{
memcpy(&chunk, &length, 4);
total += 4;
}
memcpy(&chunk, data + total, min(buffer_size, length - total));
write(sock, &chunk, min(buffer_size, length - total));
total += min(buffer_size, length - total);
}
But in that case I don't know what write's buffer size actually is (is there an API to get it?) I also don't know if this is an appropriate solution.
There is an option to do this already. It will inform your network layer that you are going to send more data and you want to buffer rather than send it as soon as possible.
setsockopt(sock_descriptor, IPPROTO_TCP, TCP_CORK, (char *)&val, sizeof(val));
val is an int, and should be 0 or 1, with the "cork" on, your network layer will buffer things as much as possible, to only send full packets, you might want to "pop the cork" and "cork" again to handle the next batch of transmissions that you need to make on the socket.
Your idea is correct, this just saves you the trouble of implementing it, since it's already done in the network stack.
I suggest having a look at writev() (see man writev for full details).
This allows you to send multiple buffers in one go, with just one call. As a simple example, to send out two chunks in one go (one for length, one for data):
struct iovec bits[2];
/* First chunk is the length */
bits[0].iov_base = &length;
bits[0].iov_len = sizeof(length);
/* Second chunk is the payload */
bits[1].iov_base = data;
bits[1].iov_base = length;
/* Send two chunks at once */
writev(socket, bits, 2);
It can get more complicated if you need to use a variable number of chunks (you may need to allocate the array of struct iov dynamically), but the advantage is that, if your chunks are large, you can avoid copying them, and just manipulate pointer/length pairs, which are much smaller.
I think you are on the right track with your spooled solution presented. I think buffer_size should be larger than that used internally by the network stack. This way, you minimize the amount of per-write overhead without having to allocate a giant buffer. In other words, by giving the underlying network subsystem more data than it can handle at once, it is free to run at its fastest speed, spending most of its time moving data, rather than waiting for more data to be provided.
The optimal buffer_size value might vary from system to system. I would start with 1MB and do some experiments up and down from there to see what works best. There might also be values you can extract and adjust with a sysctl call for the current internal buffer size used on your system. Read this for a suggested technique. You might also use something like getsockopt(..., SO_MAX_MSG_SIZE, ...), as explained here.
Ethernet packets can range up to about 64K in size, so perhaps anything larger than 64K is sufficient. Read about maximum transmission unit (MTU) sizes to get a sense of what the lowest layers of the network stack are doing, and don't forget that the MTU varies with the network interface, not the process or kernel.
Beware that the MTU can vary along the route from your server to the data's destination. You can use ifconfig or traceroute/tracepath to discover it. With networking, every link in the chain is weak. ;)

types of buffers

Recently an interviewer asked me about the types of buffers. What types of buffers are there ? Actually this question came up when I said I will be writing all the system calls to a log file to monitor the system. He said it will be slow to write each and every call to a file. How to prevent it. I said I will use a buffer and he asked me what type of buffer ? Can some one explain me types of buffers please.
In C under UNIX (and probably other operating systems as well), there are usually two buffers, at least in your given scenario.
The first exists in the C runtime libraries where information to be written is buffered before being delivered to the OS.
The second is in the OS itself, where information is buffered until it can be physically written to the underlying media.
As an example, we wrote a logging library many moons ago that forced information to be written to the disk so that it would be there if either the program crashed or the OS crashed.
This was achieved with the sequence:
fflush (fh); fsync (fileno (fh));
The first of these actually ensured that the information was handed from the C runtime buffers to the operating system, the second that it was written to disk. Keep in mind that this is an expensive operation and should only be done if you absolutely need the information written immediately (we only did it at the SUPER_ENORMOUS_IMPORTANT logging level).
To be honest, I'm not entirely certain why your interviewer thought it would be slow unless you're writing a lot of information. The two levels of buffering already there should perform quite adequately. If it was a problem, then you could just introduce another layer yourself which wrote the messages to an in-memory buffer and then delivered that to a single fprint-type call when it was about to overflow.
But, unless you do it without any function calls, I can't see it being much faster than what the fprint-type buffering already gives you.
Following clarification in comments that this question is actually about buffering inside a kernel:
Basically, you want this to be as fast, efficient and workable (not prone to failure or resource shortages) as possible.
Probably the best bet would be a buffer, either statically allocated or dynamically allocated once at boot time (you want to avoid the possibility that dynamic re-allocation will fail).
Others have suggested a ring (or circular) buffer but I wouldn't go that way (technically) for the following reason: the use of a classical circular buffer means that to write out the data when it has wrapped around will take two independent writes. For example, if your buffer has:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|s|t|r|i|n|g| |t|o| |w|r|i|t|e|.| | | | | | |T|h|i|s| |i|s| |t|h|e| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
^ ^
| |
Buffer next --+ +-- Buffer start
then you'll have to write "This is the " followed by "string to write.".
Instead, maintain the next pointer and, if the bytes in the buffer plus the bytes to be added are less than the buffer size, just add them to the buffer with no physical write to the underlying media.
Only if you are going to overflow the buffer do you start doing tricky stuff.
You can take one of two approaches:
Either flush the buffer as it stands, set the next pointer back to the start for processing the new message; or
Add part of the message to fill up the buffer, then flush it and set the next pointer back to the start for processing the rest of the message.
I would probably opt for the second given that you're going to have to take into account messages that are too big for the buffer anyway.
What I'm talking about is something like this:
initBuffer:
create buffer of size 10240 bytes.
set bufferEnd to end of buffer + 1
set bufferPointer to start of buffer
return
addToBuffer (size, message):
while size != 0:
xfersz = minimum (size, bufferEnd - bufferPointer)
copy xfersz bytes from message to bufferPointer
message = message + xfersz
bufferPointer = bufferPointer + xfersz
size = size - xfersz
if bufferPointer == bufferEnd:
write buffer to underlying media
set bufferPointer to start of buffer
endif
endwhile
That basically handles messages of any size efficiently by reducing the number of physical writes. There will be optimisations of course - it's possible that the message may have been copied into kernel space so it makes little sense to copy it to the buffer if you're going to write it anyway. You may as well write the information from the kernel copy directly to the underlying media and only transfer the last bit to the buffer (since you have to save it).
In addition, you'd probably want to flush an incomplete buffer to the underlying media if nothing had been written for a time. That would reduce the likelihood of missing information on the off chance that the kernel itself crashes.
Aside: Technically, I guess this is sort of a circular buffer but it has special case handling to minimise the number of writes, and no need for a tail pointer because of that optimisation.
There are also ring buffers which have bounded space requirements and are probably best known in the Unix dmesg facility.
What comes to mind for me is time-based buffers and size-based. So you could either just write whatever is in the buffer to file once every x seconds/minutes/hours or whatever. Alternatively, you could wait until there are x log entries or x bytes worth of log data and write them all at once. This is one of the ways that log4net and log4J do it.
Overall, there are "First-In-First-Out" (FIFO) buffers, also known as queues; and there are "Latest*-In-First-Out" (LIFO) buffers, also known as stacks.
To implement FIFO, there are circular buffers, which are usually employed where a fixed-size byte array has been allocated. For example, a keyboard or serial I/O device driver might use this method. This is the usual type of buffer used when it is not possible to dynamically allocate memory (e.g., in a driver which is required for the operation of the Virtual Memory (VM) subsystem of the OS).
Where dynamic memory is available, FIFO can be implemented in many ways, particularly with linked-list derived data structures.
Also, binomial heaps implementing priority queues may be used for the FIFO buffer implementation.
A particular case of neither FIFO nor LIFO buffer is the TCP segment reassembly buffers. These may hold segments received out-of-order ("from the future") which are held pending the receipt of intermediate segments not-yet-arrived.
* My acronym is better, but most would call LIFO "Last In, First Out", not Latest.
Correct me if I'm wrong, but wouldn't using a mmap'd file for the log avoid both the overhead of small write syscalls and the possibility of data loss if the application (but not the OS) crashed? It seems like an ideal balance between performance and reliability to me.

Resources