Lockfree buffer updates with variable-length messages in C

Lockfree buffer updates with variable-length messages in C - c

I have 2 buffer of size N. I want to write to the buffer from different threads without using locks.
I maintain a buffer index (0 and 1) and an offset where the new write operation to buffer starts. If I can get the current offset and set the offset at offset + len_of_the_msg in an atomic manner, it will guarantee that the different threads will not overwrite each other. I also have to take care of buffer overflow. Once a buffer is full, switch buffer and set offset to 0.
Task to do in order:
set a = offset
increment offset by msg_len
if offset > N: switch buffer, set a to 0, set offset to msg_len
I am implementing this in C. Compiler is gcc.
How to do this operations in an atomic manner without using locks? Is it possible to do so?
EDIT:
I don't have to use 2 buffers. What I want to do is "Collect log message from different threads into a buffer and send the buffer to a server once some buffer usage threshold is reached"

re: your edit:
I don't have to use 2 buffers. What I want to do is: Collect log message from different threads into a buffer and send the buffer to a server once some buffer usage threshold is reached
A lock-free circular buffer could maybe work, with the reader collecting all data up to the last written entry. Extending an existing MPSC or MPMC queue based on using an array as a circular buffer is probably possible; see below for hints.
Verifying that all entries have been fully written is still a problem, though, as are variable-width entries. Doing that in-band with a length + sequence number would mean you couldn't just send the byte-range to the server, and the reader would have to walk through the "linked list" (of length "pointers") to check the sequence numbers, which is slow when they inevitably cache miss. (And can possibly false-positive if stale binary data from a previous time through the buffer happens to look like the right sequence number, because variable-length messages might not line up the same way.)
Perhaps a secondary array of fixed-width start/end-position pairs could be used to track "done" status by sequence number. (Writers store a sequence number with a release-store after writing the message data. Readers seeing the right sequence number know that data was written this time through the circular buffer, not last time. Sequence numbers provide ABA protection vs. a "done" flag that the reader would have to unset as it reads. The reader can indicate its read position with an atomic integer.)
I'm just brainstorming ATM, I might get back to this and write up more details or code, but I probably won't. If anyone else wants to build on this idea and write up an answer, feel free.
It might still be more efficient to do some kind of non-lock-free synchronization that makes sure all writers have passed a certain point. Or if each writer stores the position it has claimed, the reader can scan that array (if there are only a few writer threads) and find the lowest not-fully-written position.
I'm picturing that a writer should wake the reader (or even perform the task itself) after detecting that its increment has pushed the used space of the queue up past some threshold. Make the threshold a little higher than you normally want to actually send with, to account for partially-written entries from previous writers not actually letting you read this far.
If you are set on switching buffers:
I think you probably need some kind of locking when switching buffers. (Or at least stronger synchronization to make sure all claimed space in a buffer has actually been written.)
But within one buffer, I think lockless is possible. Whether that helps a lot or a little depends on how you're using it. Bouncing cache lines around is always expensive, whether that's just the index, or whether that's also a lock plus some write-index. And also false sharing at the boundaries between two messages, if they aren't all 64-byte aligned (to cache line boundaries.)
The biggest problem is that the buffer-number can change while you're atomically updating the offset.
It might be possible with a separate offset for each buffer, and some extra synchronization when you change buffers.
Or you can pack the buffer-number and offset into a single 64-bit struct that you can attempt to CAS with atomic_compare_exchange_weak. That can let a writer thread claim that amount of space in a known buffer. You do want CAS, not fetch_add because you can't build an upper limit into fetch_add; it would race with any separate check.
So you read the current offset, check there's enough room, then try to CAS with offset+msg_len. On success, you've claimed that region of that buffer. On fail, some other thread got it first. This is basically the same as what a multi-producer queue does with a circular buffer, but we're generalizing to reserving a byte-range instead of just a single entry with CAS(&write_idx, old, old+1).
(Maybe possible to use fetch_add and abort if the final offset+len you got goes past the end of the buffer. If you can avoid doing any fetch_sub to undo it, that could be good, but it would be worse if you had multiple threads trying to undo their mistakes with more modifications. That would still leave the possible problem of a large message stopping other small messages from packing into the end of a buffer, given some orderings. CAS avoids that because only actually-usable offsets get swapped in.)
But then you also need a mechanism to know when that writer has finished storing to that claimed region of the buffer. So again, maybe extra synchronization around a buffer-change is needed for that reason, to make sure all pending writes have actually happened before we let readers touch it.
A MPMC queue using a circular buffer (e.g. Lock-free Progress Guarantees) avoids this by only having one buffer, and giving writers a place to mark each write as done with a release-store, after they claimed a slot and stored into it. Having fixed-size slots makes this much easier; variable-length messages would make that non-trivial or maybe not viable at all.
The "claim a byte-range" mechanism I'm proposing is very much what lock-free array-based queues, to, though. A writer tries to CAS a write-index, then uses that claimed space.
Obviously all of this would be done with C11 #include <stdatomic.h> for _Atomic size_t offsets[2], or with GNU C builtin __atomic_...

I believe this is not solvable in a lock-free manner, unless you're only ruling out OS-level locking primitives and can live with brief spin locks in application code (which would be a bad idea).
For discussion, let's assume your buffers are organized this way:
#define MAXBUF 100
struct mybuffer {
char data[MAXBUF];
int index;
};
struct mybuffer Buffers[2];
int currentBuffer = 0; // switches between 0 and 1
Though parts can be done with atomic-level primitives, in this case the entire operation has to be done atomically so is really one big critical section. I cannot imagine any compiler with a unicorn primitive for this.
Looking at the GCC __atomic_add_fetch() primitive, this adds a given value (the message size) to a variable (the current buffer index), returning the new value; this way you could test for overflow.
Looking at some rough code that is not correct;
// THIS IS ALL WRONG!
int oldIndex = Buffers[current]->index;
if (__atomic_add_fetch(&Buffers[current]->index, mysize, _ATOMIC_XX) > MAXBUF)
{
// overflow, must switch buffers
// do same thing with new buffer
// recompute oldIndex
}
// copy your message into Buffers[current] at oldIndex
This is wrong in every way, because at almost every point some other thread could sneak in and change things out from under you, causing havoc.
What if your code grabs the oldIndex that happens to be from buffer 0, but then some other thread sneaks in and changes the current buffer before your if test even gets to run?
The __atomic_add_fetch() would then be allocating data in the new buffer but you'd copy your data to the old one.
This is the NASCAR of race conditions, I do not see how you can accomplish this without treating the whole thing as a critical section, making other processes wait their turn.
void addDataTobuffer(const char *msg, size_t n)
{
assert(n <= MAXBUF); // avoid danger
// ENTER CRITICAL SECTION
struct mybuffer *buf = Buffers[currentBuffer];
// is there room in this buffer for the entire message?
// if not, switch to the other buffer.
//
// QUESTION: do messages have to fit entirely into a buffer
// (as this code assumes), or can they be split across buffers?
if ((buf->index + n) > MAXBUF)
{
// QUESTION: there is unused data at the end of this buffer,
// do we have to fill it with NUL bytes or something?
currentBuffer = (currentBuffer + 1) % 2; // switch buffers
buf = Buffers[currentBuffer];
}
int myindex = buf->index;
buf->index += n;
// copy your data into the buffer at myindex;
// LEAVE CRITICAL SECTION
}
We don't know anything about the consumer of this data, so we can't tell how it gets notified of new messages, or if you can move the data-copy outside the critical section.
But everything inside the critical section MUST be done atomically, and since you're using threads anyway, you may as well use the primitives that come with thread support. Mutexes probably.
One benefit of doing it this way, in addition to avoiding race conditions, is that the code inside the critical section doesn't have to use any of the atomic primitives and can just be ordinary (but careful) C code.
An additional note: it's possible to roll your own critical section code with some interlocked exchange shenanigans, but this is a terrible idea because it's easy to get wrong, makes the code harder to understand, and avoids tried-and-true thread primitives designed for exactly this purpose.

Related

how can I ensure "nice" (or at least "ok") cache behavior with a volatile mutex'd array?

I have two threads which share a large array of data. One thread writes to it, and the other reads from it. Because the array cannot be in an "incompletely-updated" state when read, I mutex all array operations (reads/writes).
I also try to "play nicely with the cache"- when I read/write large amounts of data, I get one mutex, and read/write as much as required in sequence, then relinquish the mutex.
[edit: to clarify, this is the cache behavior I would like to preserve. if you read/write large swathes of memory in sequence, then the cache can pull in large lines of data from memory (slow!) only once, then operate on the cache (fast!) without again hitting memory until that cache line is exhausted.]
One thing I would like to protect against is "writing to a small part of the array in one thread, and then in another thread (after receiving the mutex) reading from that small part of the array which hasn't yet been flushed to memory (out of the first thread/core's cache), resulting in an outdated read". So the solution would be to mark the array as "volatile" (right?).
Am I correct to worry that "marking the array as volatile" will totally kill my ability to read/write large chunks in accordance with a well-behaved cache? Or will every read/write be called to/from memory?
In a perfect world, what I think I'd want is the ability to: 1. grab a mutex, 2. load data from memory (as though it were volatile), 3. read/write to array (as though it weren't volatile- should be safe to rely on own cache bc mutex), 4.(in the case of write) flush any remaining cache to memory. 5. relinquish mutex
Can I accomplish this? Are there any glaring misunderstandings here on my part?

Is there a portable way to discard a number of readable bytes from a socket-like file descriptor?

Is there a portable way to discard a number of incoming bytes from a socket without copying them to userspace? On a regular file, I could use lseek(), but on a socket, it's not possible. I have two scenarios where I might need it:
A stream of records is arriving on a file descriptor (which can be a TCP, a SOCK_STREAM type UNIX domain socket or potentially a pipe). Each record is preceeded by a fixed size header specifying its type and length, followed by data of variable length. I want to read the header first and if it's not of the type I'm interested in, I want to just discard the following data segment without transferring them into user space into a dummy buffer.
A stream of records of varying and unpredictable length is arriving on a file descriptor. Due to asynchronous nature, the records may still be incomplete when the fd becomes readable, or they may be complete but a piece of the next record already may be there when I try to read a fixed number of bytes into a buffer. I want to stop reading the fd at the exact boundary between the records so I don't need to manage partially loaded records I accidentally read from the fd. So, I use recv() with MSG_PEEK flag to read into a buffer, parse the record to determine its completeness and length, and then read again properly (thus actually removing data from the socket) to the exact length. This would copy the data twice - I want to avoid that by simply discarding the data buffered in the socket by an exact amount.
On Linux, I gather it is possible to achieve that by using splice() and redirecting the data to /dev/null without copying them to userspace. However, splice() is Linux-only, and the similar sendfile() that is supported on more platforms can't use a socket as input. My questions are:
Is there a portable way to achieve this? Something that would work on other UNIXes (primarily Solaris) as well that do not have splice()?
Is splice()-ing into /dev/null an efficient way to do this on Linux, or would it be a waste of effort?
Ideally, I would love to have a ssize_t discard(int fd, size_t count) that simply removes count of readable bytes from a file descriptor fd in kernel (i.e. without copying anything to userspace), blocks on blockable fd until the requested number of bytes is discarded, or returns the number of successfully discarded bytes or EAGAIN on a non-blocking fd just like read() would do. And advances the seek position on a regular file of course :)

The short answer is No, there is no portable way to do that.
The sendfile() approach is Linux-specific, because on most other OSes implementing it, the source must be a file or a shared memory object. (I haven't even checked if/in which Linux kernel versions, sendfile() from a socket descriptor to /dev/null is supported. I would be very suspicious of code that does that, to be honest.)
Looking at e.g. Linux kernel sources, and considering how little a ssize_t discard(fd, len) differs from a standard ssize_t read(fd, buf, len), it is obviously possible to add such support. One could even add it via an ioctl (say, SIOCISKIP) for easy support detection.
However, the problem is that you have designed an inefficient approach, and rather than fix the approach at the algorithmic level, you are looking for crutches that would make your approach perform better.
You see, it is very hard to show a case where the "extra copy" (from kernel buffers to userspace buffers) is an actual performance bottleneck. The number of syscalls (context switches between userspace and kernel space) sometimes is. If you sent a patch upstream implementing e.g. ioctl(socketfd, SIOCISKIP, bytes) for TCP and/or Unix domain stream sockets, they would point out that the performance increase this hopes to achieve is better obtained by not trying to obtain the data you don't need in the first place. (In other words, the way you are trying to do things, is inherently inefficient, and rather than create crutches to make that approach work better, you should just choose a better-performing approach.)
In your first case, a process receiving structured data framed by a type and length identifier, wishing to skip unneeded frames, is better fixed by fixing the transfer protocol. For example, the receiving side could inform the sending side which frames it is interested in (i.e., basic filtering approach). If you are stuck with a stupid protocol that you cannot replace for external reasons, you're on your own. (The FLOSS developer community is not, and should not be responsible for maintaining stupid decisions just because someone wails about it. Anyone is free to do so, but they'd need to do it in a manner that does not require others to work extra too.)
In your second case, you already read your data. Don't do that. Instead, use an userspace buffer large enough to hold two full size frames. Whenever you need more data, but the start of the frame is already past the midway of the buffer, memmove() the frame to start at the beginning of the buffer first.
When you have a partially read frame, and you have N unread bytes from that left that you are not interested in, read them into the unused portion of the buffer. There is always enough room, because you can overwrite the portion already used by the current frame, and its beginning is always within the first half of the buffer.
If the frames are small, say 65536 bytes maximum, you should use a tunable for the maximum buffer size. On most desktop and server machines, with high-bandwidth stream sockets, something like 2 MiB (2097152 bytes or more) is much more reasonable. It's not too much memory wasted, but you rarely do any memory copies (and when you do, they tend to be short). (You can even optimize the memory moves so that only full cachelines are copied, aligned, since leaving almost one cacheline of garbage at the start of the buffer is insignificant.)
I do HPC with large datasets (including text-form molecular data, where records are separated by newlines, and custom parsers for converting decimal integers or floating-point values are used for better performance), and this approach does work well in practice. Simply put, skipping data already in your buffer is not something you need to optimize; it is insignificant overhead compared to simply avoiding doing the things you do not need.
There is also the question of what you wish to optimize by doing that: the CPU time/resources used, or the wall clock used in the overall task. They are completely different things.
For example, if you need to sort a large number of text lines from some file, you use the least CPU time if you simply read the entire dataset to memory, construct an array of pointers to each line, sort the pointers, and finally write each line (using either internal buffering and/or POSIX writev() so that you do not need to do a write() syscall for each separate line).
However, if you wish to minimize the wall clock time used, you can use a binary heap or a balanced binary tree instead of an array of pointers, and heapify or insert-in-order each line completely read, so that when the last line is finally read, you already have the lines in their correct order. This is because the storage I/O (for all but pathological input cases, something like single-character lines) takes longer than sorting them using any robust sorting algorithm! The sorting algorithms that work inline (as data comes in) are typically not as CPU-efficient as those that work offline (on complete datasets), so this ends up using somewhat more CPU time; but because the CPU work is done at a time that is otherwise wasted waiting for the entire dataset to load into memory, it is completed in less wall clock time!
If there is need and interest, I can provide a practical example to illustrate the techniques. However, there is absolutely no magic involved, and any C programmer should be able to implement these (both the buffering scheme, and the sort scheme) on their own. (I do consider using resources like Linux man pages online and Wikipedia articles and pseudocode on for example binary heaps doing it "on your own". As long as you do not just copy-paste existing code, I consider it doing it "on your own", even if somebody or some resource helps you find the good, robust ways to do it.)

Strategy for estimating / calculating buffer space needed by writer function on embedded system

This isn't a show-stopping programming problem as such, but perhaps more of a design pattern issue. I'd have thought it'd be a common design issue on embedded resource-limited systems, but none of the questions I found so far on SO seem relevant (but please point out anything relevant that I could have missed).
Essentially, I'm trying to work out the best strategy of estimating the largest buffer size required by some writer function, when that writer function's output isn't fixed, particularly because some of the data are text strings of variable length.
This is a C application that runs on a small ARM micro. The application needs to send various message types via TCP socket. When I want to send a TCP packet, the TCP stack (Keil RL) provides me with a buffer (which the library allocates from its own pool) into which I may write the packet data payload. That buffer size depends of course on the MSS; so let's assume it's 1460 at most, but it could be smaller.
Once I have this buffer, I pass this buffer and its length to a writer function, which in turn may call various nested writer functions in order to build the complete message. The reason for this structure is because I'm actually generating a small XML document, where each writer function typically generates a specific XML element. Each writer function wants to write a number of bytes to my allocated TCP packet buffer. I only know exactly how many bytes a given writer function writes at run-time, because some of the encapsulated content depends on user-defined text strings of variable length.
Some messages need to be around (say) 2K in size, meaning they're likely to be split across at least two TCP packet send operations. Those messages will be constructed by calling a series of writer functions that produce, say, a hundred bytes at a time.
Prior to making a call to each writer function, or perhaps within the writer function itself, I initially need to compare the buffer space available with how much that writer function requires; and if there isn't enough space available, then transmit that packet and continue writing into a fresh packet later.
Possible solutions I am considering are:
Use another much larger buffer to write everything into initially. This isn't preferred because of resource constraints. Furthermore, I would still wish for a means to algorithmically work out how much space I need by my message writer functions.
At compile time, produce a 'worst case size' constant for each writer function. Each writer function typically generates an XML element such as <START_TAG>[string]</START_TAG>, so I could have something like: #define SPACE_NEEDED ( START_TAG_LENGTH + START_TAG_LENGTH + MAX_STRING_LENGTH + SOME_MARGIN ). All of my content writer functions are picked out of a table of function pointers anyway, so I could have the worst-case size estimate constants for each writer function exist as a new column in that table. At run-time, I check the buffer room against that estimate constant. This probably my favourite solution at the moment. The only downside is that it does rely on correct maintenance to make it work.
My writer functions provide a special 'dummy run' mode where they run though and calculate how many bytes they want to write but don't write anything. This could be achieved by perhaps simply sending NULL in place of the buffer pointer to the function, in which case the functions's return value (which usually states amount written to buffer) just states how much it wants to write. The only thing I don't like about this is that, between the 'dummy' and 'real' call, the underlying data could - at least in theory - change. A possible solution for that could be to statically capture the underlying data.
Thanks in advance for any thoughts and comments.
Solution
Something I had actually already started doing since posting the question was to make each content writer function accept a state, or 'iteration' parameter, which allows the writer to be called many times over by the TCP send function. The writer is called until it flags that it has no more to write. If the TCP send function decides after a certain iteration that the buffer is now nearing full, it sends the packet and then the process continues later with a new packet buffer. This technique is very similar I think to Max's answer, which I've therefore accepted.
A key thing is that on each iteration, a content writer must be designed so that it won't write more than LENGTH bytes to the buffer; and after each call to the writer, the TCP send function will check that it has LENGTH room left in the packet buffer before calling the writer again. If not, it continues in a new packet.
Another step I did was to have a serious think about how I structure my message headers. It became apparent that, like I suppose with almost all protocols that use TCP, it is essential to implement into the application protocol some means of indicating the total message length. The reason for this is because TCP is a stream-based protocol, not a packet-based protocol. This is again where it got a bit of a headache because I needed some upfront means of knowing the total message length for insertion into the start header. The simple solution to this was to insert a message header into the start of every sent TCP packet, rather than only at the start of the application protocol message (which may of course span several TCP sockets), and basically implement fragmentation. So, in the header, I implemented two flags: a fragment flag, and a last-fragment flag. Therefore the length field in each header only needs to state the size of the payload in the particular packet. At the receiving end, individual header+payload chunks are read out of the stream and then reassembled into a complete protocol message.
This of course is no doubt very simplistically how HTTP and so many other protocols work over TCP. It's just quite interesting that, only once I've attempted to write a robust protocol that works over TCP, have I started to realise the importance of really thinking the your message structure in terms of headers, framing, and so forth so that it works over a stream protocol.

I had a related problem in a much smaller embedded system, running on a PIC 16 micro-controller (and written in assembly language, rather than C). My 'buffer size' was always going to be the two byte UART transmit queue, and I had only one 'writer' function, which was walking a DOM and emitting its XML serialisation.
The solution I came up with was to turn the problem 'inside out'. The writer function becomes a task: each time it is called it writes as many bytes as it can (which may be >2 depending on the serial data transmission rate) until the transmit buffer is full, then it returns. However, it remembers, in a state variable, how far it had got through the DOM. The next time it is called, it caries on from the point previously reached. The writer task is called repeatedly from a loop. If there is no free buffer space, it returns immediately without changing its state. It is called repeatedly from an infinite loop, which acts as a round-robin scheduler for this task and the others in the system. Each time round the loop, there is a delay which waits for the TMR0 timer to overflow. So each task gets called exactly once in a fixed time slice.
In my implementation, the data is transmitted by a TxEmpty interrupt routine, but it could also be sent by another task.
I guess the 'pattern' here is that one role of the program counter is to hold the current state of the flow of control, and that this role can be abstracted away from the PC to another data structure.
Obviously, this isn't immediately applicable to your larger, higher level system. But it is a different way of looking at the problem, which may spark your own particulr insight.
Good luck!

Buffering data from sockets?

I am trying to make a simple HTTP server that would be able to parse client requests and send responses back.
Now I have a problem. I have to read and handle one line at a time in the request, and I don't know if I should:
read one byte at a time, or
read chunks of N bytes at a time, put them in a buffer, and then handle the bytes one by one, before reading a new chunk of bytes.
What would be the best option, and why?
Also, are there some alternative solutions to this? Like a function that would read a line at a time from the socket or something?

Single byte at a time is going to kill performance. Consider a circular buffer of decent size.
Read chunks of whatever size is free in the buffer. Most of the time you will get short reads. Check for the end of the http command in the read bytes. Process complete commands and next byte becomes head of buffer. If buffer becomes full, copy it off to a backup buffer, use a second circular buffer, report an error or whatever is appropriate.

The short answer to your question is that I would go with reading a single byte at a time. Unfortunately its one of those cases where there are pros and cons for both cases.
For the use of a buffer is the fact that the implementation can be more efficient from the perspective of the network IO. Against the use of a buffer, I think that the code will be inherently more complex than the single byte version. So its an efficiency vs complexity trade off. The good news is that you can implement the simple solution first, profile the result and "upgrage" to a buffered approach if testing shows it to be worthwhile.
Also, just to note, as a thought experiment I wrote some pseudo code for a loop that does buffer based reads of http packets, included below. The complexity to implement a buffered read doesn't seem to bad. Note however that I haven't given much consideration to error handling, or tested if this will work at all. However, it should avoid excessive "double handling" of data, which is important since that would reduce the efficiency gains which were the purpose of this approach.
#define CHUNK_SIZE 1024
nextHttpBytesRead = 0;
nextHttp = NULL;
while (1)
{
size_t httpBytesRead = nextHttpBytesRead;
size_t thisHttpSize;
char *http = nextHttp;
char *temp;
char *httpTerminator;
do
{
temp = realloc(httpBytesRead + CHUNK_SIZE);
if (NULL == temp)
...
http = temp;
httpBytesRead += read(httpSocket, http + httpBytesRead, CHUNK_SIZE);
httpTerminator = strstr(http, "\r\n\r\n");
}while (NULL == httpTerminator)
thisHttpSize = ((int)httpTerminator - (int)http + 4; // Include terminator
nextHttpBytesRead = httpBytesRead - thisHttpSize;
// Adding CHUNK_SIZE here means that the first realloc won't have to do any work
nextHttp = malloc(nextHttpBytesRead + CHUNK_SIZE);
memcpy(nextHttp, http + thisHttpSize, nextHttpSize);
http[thisHttpSize] = '\0';
processHttp(http);
}

TCP data stream is coming in at one IP packet at a time, which can be up to 1,500 or so bytes depending on the IP layer configuration. In Linux this is going to be wait in one SKB waiting for the application layer to read off the queue. If you read one byte at a time you suffer the overhead of context switches between the application and kernel for simply copying one byte from one structure to another. The optimum solution is to use non-blocking IO to read the content of one SKB at a time and so minimize the switches.
If you are after optimum bandwidth you could read off longer size of bytes in order to further reduce the context switches at the expensive of latency as more time will be spent out of the application copying memory. But this only applies to extremes and such code changes should be implemented when required.
If you examine the multitude of existing HTTP technologies you can find alternative approaches such as using multiple threads and blocking sockets, pushing more work back into the kernel to reduce the overhead of switching into the application and back.
I have implemented a HTTP server library very similar to torak's pseudo code here, http://code.google.com/p/openpgm/source/browse/trunk/openpgm/pgm/http.c The biggest speed improvements for this implementation came from making everything asynchronous so nothing ever blocks.

Indy, for example, takes the buffered approach. When the code asks Indy to read a line, it first checks its current buffer to see if a line break is present. If not, the network is read in chunks and appended to the buffer until the line break appears. Once it does, just the data up to the line break is removed from the buffer and returned to the app, leaving any remaining data in the buffer for the next reading operation. This makes for a good balance between a simple application-level API (ReadLine, ReadStream, etc), while providing for efficient network I/O (read everything that is currently available in the socket, buffer it, and ack it so the sender is not waiting too long - fewer network-level reads are needed this way).

Start by reading one byte at a time (though noting that lines end with cr/lf in HTTP) because it's simple. If that's not sufficient, do more complex things.

Read a byte array buffer at a time. Reading single characters will be dog slow because of the multiple context switches between user and kernel mode (depending on the libc actually).
If you read buffers, you need to be prepared that the buffer could eighter be not filled completely (watch the length return), that the buffer does not contain enough bytes to line end or that the buffer contains more than one line.
It is a common pattern in network applications how to map your line or fixed-size block requests to that variable steam of buffers (and often implemented wrong, for example a 0 byte length answer is possible). Higher languages will hide you from this complexity.

types of buffers

Recently an interviewer asked me about the types of buffers. What types of buffers are there ? Actually this question came up when I said I will be writing all the system calls to a log file to monitor the system. He said it will be slow to write each and every call to a file. How to prevent it. I said I will use a buffer and he asked me what type of buffer ? Can some one explain me types of buffers please.

In C under UNIX (and probably other operating systems as well), there are usually two buffers, at least in your given scenario.
The first exists in the C runtime libraries where information to be written is buffered before being delivered to the OS.
The second is in the OS itself, where information is buffered until it can be physically written to the underlying media.
As an example, we wrote a logging library many moons ago that forced information to be written to the disk so that it would be there if either the program crashed or the OS crashed.
This was achieved with the sequence:
fflush (fh); fsync (fileno (fh));
The first of these actually ensured that the information was handed from the C runtime buffers to the operating system, the second that it was written to disk. Keep in mind that this is an expensive operation and should only be done if you absolutely need the information written immediately (we only did it at the SUPER_ENORMOUS_IMPORTANT logging level).
To be honest, I'm not entirely certain why your interviewer thought it would be slow unless you're writing a lot of information. The two levels of buffering already there should perform quite adequately. If it was a problem, then you could just introduce another layer yourself which wrote the messages to an in-memory buffer and then delivered that to a single fprint-type call when it was about to overflow.
But, unless you do it without any function calls, I can't see it being much faster than what the fprint-type buffering already gives you.
Following clarification in comments that this question is actually about buffering inside a kernel:
Basically, you want this to be as fast, efficient and workable (not prone to failure or resource shortages) as possible.
Probably the best bet would be a buffer, either statically allocated or dynamically allocated once at boot time (you want to avoid the possibility that dynamic re-allocation will fail).
Others have suggested a ring (or circular) buffer but I wouldn't go that way (technically) for the following reason: the use of a classical circular buffer means that to write out the data when it has wrapped around will take two independent writes. For example, if your buffer has:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|s|t|r|i|n|g| |t|o| |w|r|i|t|e|.| | | | | | |T|h|i|s| |i|s| |t|h|e| |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
^ ^
| |
Buffer next --+ +-- Buffer start
then you'll have to write "This is the " followed by "string to write.".
Instead, maintain the next pointer and, if the bytes in the buffer plus the bytes to be added are less than the buffer size, just add them to the buffer with no physical write to the underlying media.
Only if you are going to overflow the buffer do you start doing tricky stuff.
You can take one of two approaches:
Either flush the buffer as it stands, set the next pointer back to the start for processing the new message; or
Add part of the message to fill up the buffer, then flush it and set the next pointer back to the start for processing the rest of the message.
I would probably opt for the second given that you're going to have to take into account messages that are too big for the buffer anyway.
What I'm talking about is something like this:
initBuffer:
create buffer of size 10240 bytes.
set bufferEnd to end of buffer + 1
set bufferPointer to start of buffer
return
addToBuffer (size, message):
while size != 0:
xfersz = minimum (size, bufferEnd - bufferPointer)
copy xfersz bytes from message to bufferPointer
message = message + xfersz
bufferPointer = bufferPointer + xfersz
size = size - xfersz
if bufferPointer == bufferEnd:
write buffer to underlying media
set bufferPointer to start of buffer
endif
endwhile
That basically handles messages of any size efficiently by reducing the number of physical writes. There will be optimisations of course - it's possible that the message may have been copied into kernel space so it makes little sense to copy it to the buffer if you're going to write it anyway. You may as well write the information from the kernel copy directly to the underlying media and only transfer the last bit to the buffer (since you have to save it).
In addition, you'd probably want to flush an incomplete buffer to the underlying media if nothing had been written for a time. That would reduce the likelihood of missing information on the off chance that the kernel itself crashes.
Aside: Technically, I guess this is sort of a circular buffer but it has special case handling to minimise the number of writes, and no need for a tail pointer because of that optimisation.

There are also ring buffers which have bounded space requirements and are probably best known in the Unix dmesg facility.

What comes to mind for me is time-based buffers and size-based. So you could either just write whatever is in the buffer to file once every x seconds/minutes/hours or whatever. Alternatively, you could wait until there are x log entries or x bytes worth of log data and write them all at once. This is one of the ways that log4net and log4J do it.

Overall, there are "First-In-First-Out" (FIFO) buffers, also known as queues; and there are "Latest*-In-First-Out" (LIFO) buffers, also known as stacks.
To implement FIFO, there are circular buffers, which are usually employed where a fixed-size byte array has been allocated. For example, a keyboard or serial I/O device driver might use this method. This is the usual type of buffer used when it is not possible to dynamically allocate memory (e.g., in a driver which is required for the operation of the Virtual Memory (VM) subsystem of the OS).
Where dynamic memory is available, FIFO can be implemented in many ways, particularly with linked-list derived data structures.
Also, binomial heaps implementing priority queues may be used for the FIFO buffer implementation.
A particular case of neither FIFO nor LIFO buffer is the TCP segment reassembly buffers. These may hold segments received out-of-order ("from the future") which are held pending the receipt of intermediate segments not-yet-arrived.
* My acronym is better, but most would call LIFO "Last In, First Out", not Latest.

Correct me if I'm wrong, but wouldn't using a mmap'd file for the log avoid both the overhead of small write syscalls and the possibility of data loss if the application (but not the OS) crashed? It seems like an ideal balance between performance and reliability to me.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight